Plugging a speech recognition engine
Olympus allows running several speech recognition engines in parallel (for example using different acoustic models, language models, or engines).
- Audio input processing and the aggregation of the hypotheses from all engines is performed by the AudioServer module.
- Each engine is a separate process that communicates with AudioServer through sockets. AudioServer captures audio from the audio device and streams it to the different engines.
- Each engine then processes it and must return a partial recognition hypothesis (i.e. what it has recognized so far in the current utterance).
- Once the utterance has been endpointed (endpointing decision is made by the Interaction Manager), each engine is informed and must return a final hypothesis for the utterance.
- Both partial and final hypothesis are usually rich Galaxy structures which contain not only the hypothesized word sequence for the utterance but also information such as confidence scores and time alignment.
Communication from the AudioServer to each engine follows a specific protocol where commands are sent as plain text through the socket. Each command consists in a command name followed by one or more arguments, all seperated by white spaces, and terminated by a newline.
- The special command engine_proc_raw announces audio data. It is followed by one text argument indicating the number of bytes in the incoming buffer, immediately followed by the audio data itself (unheadered, PCM, usually 8kHz, 16 bits, mono).
Indicates that the engine should be ready to start a new dialog. Depending on the engine, this can mean resetting some state information (e.g. which LM is used for decoding). It also indicates a new path for the log file for this engine (the engine should output log information to a separate log file for each session).
engine_new_session <log_file_path> <start_timestamp>
Indicates the path of the log file, typically of the form '..\Logs\MySystem\20081125\000\MySystem-20081125-000' (where MySystem is the name of the system, 20081125 is the date, and 000 is the session number).
- The engine is expected to append a unique engine name and '.txt' to the path to generate its log file name (e.g. '..\Logs\MySystem\20081125\000\MySystem-20081125-000-sphinx_male.txt').
The second argument specifies the initial absolute timestamp of the session as a 64-bit integer (on Windows, this is obtained through the High Performance Timer's QueryPerformanceCounter function).
- All time alignment information (e.g. word alignment within utterances) must be given relative to this initial timestamp. For example, if the hypothesized text of the utterance is FIVE PM, time alignment should be given as the time since the beginning of the session (given by the start_timestamp argument) until the beginning (resp. end) of the word FIVE and similarly for PM.
Indicates that a new utterance is starting.
The argument specifies a string identifying this utterance (usually something like 000, 001,...). The engine should be ready to receive audio data after this command.
Indicates that the current utterance should be terminated and that the AudioServer does not expect any results from the engine.
Indicates an incoming block of raw audio data.
Specifies the number of bytes to read as audio data. Once that many bytes have been read, the engine should be ready to receive new commands from the socket.
Indicates that the current utterance should be terminated and that AudioServer will ask for a final hypothesis.
New in Olympus 2.5
Changes the acoustic model used by the engine to thespecified one.
Indicates the acoustic model to use from now on.
New in Olympus 2.5
Gets the acoustic model in use by the engine and returns in a galaxy clause frame called audioconfig inside of the :notify_acoustic_model property.
Changes the language model/grammar in use by the engine to specified one. This allows for context-dependent language models. If the engine does not support dynamic LM switching, this command should be ignored.
Indicates the name of the LM/grammar to use from now on.
Requests the engine to provide a final hypothesis for the last utterance. This is always called after engine_end_utt.
Requests the engine to provide a partial hypothesis for the current utterance. This is always called after engine_begin_utt and before engine_end_utt.
Requests the engine to reinitialize itself, reloading all its models and configuration.
- For example, this can be used to modify the acoustic model, language model, and/or dictionary during a dialog and have the engine take the changes into account.