Raw audio input turn
Olympus (will have) has the ability to handle utterances in which it is known a priori that the audio itself is interesting, but there is no hope for any kind of semantic understanding. This can come into play in applications where, for example, you want to record a voice memo, state your name, hum a tune and have a backend process try and figure out the name of the song, etc. In all of the cases we can't get any semantic information, so we would like to bypass most of the input chain. Here's how it (will) work:
- The Dialog Manager initiates a raw audio turn, and indicates such with a flag in the dialog_state frame. :raw_audio_turn 1
- The Interaction Manager picks up the dialog_state frame as usual, but enters a raw_audio_turn state.
- The AudioServer picks up the dialog_state frame as usual, but enters a raw_audio_turn state.
- Eventually VAD indicates voice, and publishes the event.
- The Interaction Manager picks up the VAD event as usual, and issues an increment utterance, as usual.
- The AudioServer hears the increment utterance, and starts recording but does not decode in raw_audio_state.
- The Interaction manager ignores the fact that it's not getting any partial hypothesis in raw_audio_state.
- VAD indicates non-speech.
- The Interation Manager gets the VAD message, sends a utt_end message unconditionally, and clears its raw_audio_turn_state.
- The AudioServer gets the utt_end message, pushes an audio event, and clears its raw_audio_state. The audio event is a concept with an index to the raw audio file that the AudioServer produces.
- The Interaction Manager sends a handle event to the Dialog manager for the new audio event, analogous to speech and gui events.
- The Dialog Manager accepts the audio event and binds that concept. It may store that concept in a backend system if it needs to.
Another process handles playback:
- The Dialog Manager decides it needs to play an audio segment. It may retrieve the concept (which is essentially the raw audio filename) from the backend if it doesn't already know it). It issues an (high level) play audio action with the raw audio pointer.
- The Interaction Manager picks up that new kind of action, and issues dialog_state messages that inform other servers to hold the floor (as usual)
- The Interaction Manager issues a (low level) play raw audio action with the raw audio pointer.
- The AudioServer gets the play raw audio message and commences picking up the file and playing it.
- The AudioServer needs to heed :interrupt_system_prompt messages now, just like Kalliope.
- Once completed, the AudioServer pubishes a conveyance message.