This is a high-level descriptions of the Olympus architecture and the principles behind it. For the most part, it is based on and . Please refer to these and other publications for more information.
Three layers of abstraction
Conceptually, Olympus distinguishes three layers as shown in Figure 1. At each layer, we define events, i.e. observations about the real world, and actions, i.e. requests to act upon the real world. The lowest layer represents the real world (e.g. the user speaking to interrupt the system). The intermediate layer is a first level of abstraction, which consists of real-time events and actions with continuous properties (e.g. the exact timing and duration of a user utterance, as perceived by the Voice Activity Detector and speech recognizer). Finally, the top layer is the domain of purely symbolic events and actions with typically discrete properties (e.g. a representation of the fact that the user barged in after hearing a specific phrase uttered by the system). The core components of the architecture perform two types of tasks: 1) they accept events and actions at one level and produce events and actions at the next level (event composition/action decomposition), and 2) they produce actions at a certain level in response to events at the same level (control). The interface between the real world and the intermediate layer is achieved by a set of sensors and actuators. No control happens at this level. The interface between the intermediate and top layers is performed by a new module called the Interaction Manager (IM). In addition to event composition and action decomposition, the IM controls reactive behavior that does not involve high-level cognition (e.g. stopping speaking when the user interrupts). Finally, within the top layer, the Dialogue Manager (DM) plans high-level actions based on high-level events. Being at the top of the architecture, the DM does not perform any composition/decomposition.
Sensors and actuators
At the lowest level of the architecture, between the real world and the Intermediate layer, sensors are specialized components that capture events happening in the real world and translate them into Intermediate level events. For example, the speech understanding module (which in our systems comprises speech recognition, parsing and confidence annotation) is a sensor that converts real world acoustic events into semantics (and associated timing information). In a multimodal system, the GUI or gaze tracker would be other sensors. For the most part, each sensor operates independently of the rest of the system (in particular of other sensors), and multi-sensor/multimodal integration is not performed at this level but by the Interaction Manager, which lies between the Intermediate and Higher layers. Some sensors however might use information from higher levels, such as when the speech recognizer uses a context-dependent language model.
Actuators also lie at the lowest level. These modules are in charge of acting upon the real world based on actions (more specifically action requests) sent by the Interaction Manager. The speech synthesizer is a typical example of actuator. Again, in a multimodal system, the GUI might act as an actuator (if the system can manipulate the display).
The Interaction Manager
The Interaction Manager (IM) acts both as the interface between the Intermediate and Top layers, and as the controller of the systemâ€™s reactive behavior. In particular, it sends appropriate dialogue state and floor update events to the DM, decides when to the system should start and stop speaking and handles turn-taking errors (e.g. when the system and user start speaking at the same time).
The Dialogue Manager
Dialog management. Olympus uses the RavenClaw dialog management framework . In a RavenClaw-based dialog manager, the domain-specific dialog task is represented as a tree whose internal nodes capture the hierarchical structure of the dialog, and whose leaves encapsulate atomic dialog actions (e.g., asking a question, providing an answer, accessing a database). A domain-independent dialog engine executes this dialog task, interprets the input in the current dialog context and decides which action to engage next. In the process, the dialog manager may exchange information with other domainspecific agents (e.g., application back-end, database access, temporal reference resolution agent).
Bohus, Dan & Alexander I. Rudnicky (2003), "RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda", Eurospeech 2003
Bohus, Dan; Antoine Raux; Thomas K. Harris; Maxine Eskenazi & Alexander I. Rudnicky (2007), "Olympus: an open-source framework for conversational spoken language interface research", Bridging the Gap: Academic and Industrial Research in Dialog Technology workshop at HLT/NAACL 2007
Raux, Antoine & Maxine Eskenazi (2007), "A Multi-Layer Architecture for Semi-Synchronous Event-Driven Dialogue Management", IEEE Automatic Speech Recognition and Understanding Workshop