Plugging a speech recognition engine

From Olympus

Jump to: navigation, search

Olympus allows running several speech recognition engines in parallel (for example using different acoustic models, language models, or engines). Audio input processing and the aggregation of the hypotheses from all engines is performed by the AudioServer module. Each engine is a separate process that communicates with AudioServer through sockets. AudioServer captures audio from the audio device and streams it to the different engines. Each engine then processes it and must return a partial recognition hypothesis (i.e. what it has recognized so far in the current utterance). Once the utterance has been endpointed (endpointing decision is made by the Interaction Manager), each engine is informed and must return a final hypothesis for the utterance. Both partial and final hypothesis are usually rich Galaxy structures which contain not only the hypothesized word sequence for the utterance but also information such as confidence scores and time alignment. Communication from the AudioServer to each engine follows a specific protocol where commands are sent as plain text through the socket. Each command consists in a command name followed by one or more arguments, all seperated by white spaces, and terminated by a newline. The special command engine_proc_raw announces audio data. It is followed by one text argument indicating the number of bytes in the incoming buffer, immediately followed by the audio data itself (unheadered, PCM, usually 8kHz, 16 bits, mono). The valid commands are as follows:

  • engine_new_session <log_file_path> <start_timestamp>

Indicates that the engine should be ready to start a new dialog. Depending on the engine, this can mean resetting some state information (e.g. which LM is used for decoding). It also indicates a new path for the log file for this engine (the engine should output log information to a separate log file for each session). The first argument indicates the path of the log file, typically of the form '..\Logs\MySystem\20081125\000\MySystem-20081125-000' (where MySystem is the name of the system, 20081125 is the date, and 000 is the session number). The engine is expected to append a unique engine name and '.txt' to the path to generate its log file name (e.g. '..\Logs\MySystem\20081125\000\MySystem-20081125-000-sphinx_male.txt'). The second argument specifies the initial absolute timestamp of the session as a 64-bit integer (on Windows, this is obtained through the High Performance Timer's QueryPerformanceCounter function). All time alignment information (e.g. word alignment within utterances) must be given relative to this initial timestamp. For example, if the hypothesized text of the utterance is FIVE PM, time alignment should be given as the time since the beginning of the session (given by the start_timestamp argument) until the beginning (resp. end) of the word FIVE and similarly for PM.

  • engine_begin_utt <utt_id>

Indicates that a new utterance is starting. The argument specifies a string identifying this utterance (usually something like 000, 001,...). The engine should be ready to receive audio data after this command.

  • engine_cancel_utt

Indicates that the current utterance should be terminated and that the AudioServer does not expect any results from the engine.

  • engine_end_utt

Indicates that the current utterance should be terminated and that AudioServer will ask for a final hypothesis.

  • engine_set_lm <lm_name>

Indicates the name of the LM/grammar to use from now on. This allows for context-dependent language models. If the engine does not support dynamic LM switching, this command should be ignored.

  • engine_proc_result

Requests the engine to provide a final hypothesis for the last utterance. This is always called after engine_end_utt.

  • engine_proc_partial

Requests the engine to provide a partial hypothesis for the current utterance. This is always called after engine_begin_utt and before engine_end_utt.

  • engine_proc_raw <num_bytes>

Indicates an incoming block of raw audio data. The arguments specifies the number of bytes to read as audio data. Once that many bytes have been read, the engine should be ready to receive new commands from the socket.

  • engine_restart

Requests the engine to reinitialize itself, reloading all its models. For example, this can be used to modify the acoustic, language model and/or dictionary during a dialog and have the engine take the changes into account.