LogParsers

From Olympus
(Redirected from LogParsers Reference)
Jump to: navigation, search

The LogParsers tools are a set of Perl modules and scripts that allow Olympus developers to traverse Olympus log directory structures and collect all kinds of information from the logs. Their main use is to create data files that can be fed into statistical analysis programs. Such data files can also be used as training data for various models that control the behavior of Olympus-based systems (e.g. the confidence annotation model in Helios).

Contents

Overview

One of the main roles of LogParsing scripts is to synchronize logs from different modules such as the DM, the IM, Helios, etc., so that information from these sources can be merged at the turn level. This synchronization step is necessary because turn numbers across different modules are not guaranteed to match. For example, when a user utterance yields an empty recognition hypothesis, it appears as a turn in the IM but is not forwarded to the DM (since such utterances are most likely background noise that was erroneously endpointed by the VAD). This kind of phenomenon creates mismatches between turn IDs in different modules. The LogParsers use different heuristics to resolve these discrepancies, producing a unified log hash that contains information collected from different modules for each turn.

Contents of the LogParser directory

The core of the LogParsers tools is a set of Perl modules that are located directly in the LogParsers directory. Each extracts a specific set of data from module logs and adds them to the global log. Programs that use these modules to generate particular views of the session log are in sub-folders (described below). You can either use these programs as is or use them as models for custom processing.

AllLogsSynchronizedParser.pm parse and synchronize all logs from a session
TraverseLogs.pm to help traversing log directory structures
ApolloLogParser.pm process Interaction Manager logs
HeliosLogParser.pm process Helios (confidence assessment) logs
RavenClawLogParser.pm process Dialog Manager logs
LabelsParser.pm load label files into the global log (if available)
TranscriptReader.pm read in manual transcripts (if available)
ParsedTranscriptReader.pm read in parse Phoenix trees (if available)
F0Reader.pm read in utterance F0 data (fundamental frequency, intonation marks) (if available)

The following folders contain Perl scripts (.pl) and utilities that use the above modules to extract data for different purposes.

Labeling/

These Perl scripts add automatically generated labels directly into the log directories (usually to the ????-labels.txt file). After acquiring the command line arguments, the scripts builds the list of dialogue folders

Analysis/

Process LogParsers modules to extract numerical data from logs that can be further analyzed with statistical tools such as Matlab.

Training/

Extract numerical data from the logs. The main purpose of these data is to train models that can be incorporated into the dialogue system to improve its performance.

Utils/

Perform various subsidiary tasks such as generating control files for the collection scripts or merging labels from different copies of the logs.

How to use the packages

The collect_log_info.pl script in Analysis provides a simple example on how to use the LogParsers in your own scripts. After reading in the command line argument, the script gets the list of dialogue folders, either from a control file or by traversing a directory structure:

 # if we don't have a control file, get them from the rootFolder
 if($controlFile eq "") {
   GetSessionFolders($rootFolder, $startDate, $endDate, \@sessions);
 } else {
   # o/w read them from the control file
   open(CONTROL, "< $controlFile") ||
     die("Could not open control file ($controlFile) for reading.\n");
   while(<CONTROL>) {
     chomp;
     $_ =~ /^([^\t]+?)(\t(.+)?)?$/;
     push @sessions, $1;
     push @aux_session_data, $3;
   }
   close(CONTROL);
 }

Next we configure the parsers, indicating the version of the Olympus logs (could be olympus1 or olympus2, which have slightly different logs), the system name (to allow some system-specific processing, so far only for the RoomLine, LetsGo, and LetsGoPublic systems), and the transcript cleaning rules file, which indicates how to preprocess the transcripts before parsing them or comparing them to ASR output):

 ConfigureAllLogsSynchronizedParser($olympusVersion, $systemName, $transcriptCleaningRulesFile);

Then we loop through the list of session folders and run:

 ParseAllLogs($session, \%info, $batchMode, $needTranscripts, 0, 0);

on each of them. The arguments indicate:

  • the dialogue folder
  • a reference to the hash that will receive all the log information
  • a flag indicating whether to process the ASR results that were obtained at runtime or in a subsequent batch processing (the Let's Go corpora only have runtime results so this flag should be 0 by default)
  • three flags indicating whether information from transcripts, parsed transcripts, and labels (which are not always available) should be included in the hash

We can then access all the elements of the %info hash for this dialogue (in the case of collect_log_info.pl, we only output them all to a file). See below or the comments at the beginning of each pm file for the list of available hash keys.

Example applications

To be completed.

Data Extracted by LogParser

The following is a list of keys placed into the global log hash by the different parser modules. In general, analysis programs select and compute over a subset of these keys. The existing programs will give you models for writing your own.

  • # indicates a turn index
  • $i, $j, $p, etc indicate sub-item indices
  • * indicates an optional item

General

last_turn_synchronized - the number of turns that were successfully synchronized 0 - last_turn_synchronized
synch_turns.#.apollo - index of the # synchronized turn in the apollo sub-hash
synch_turns.#.ravenclaw - index of the # synchronized turn in the ravenclaw sub-hash
synch_turns.#.helios - index of the # synchronized turn in the helios sub-hash
synch_turns.#.transcripts - index of the # synchronized turn in the transcripts sub-hash
synch_turns.#.parsed_transcripts - index of the # synchronized turn in the parsed_transcripts sub-hash
synch_turns.#.labels - index of the # synchronized turn in the labels sub-hash
* error - description of the error encountered (if ParseAllLogs failed)

Apollo

apollo.filename - the name of the file being processed
apollo.start_time - the time the session started
apollo.end_time - the time the session ended
apollo.session_duration_msec - the session duration in milliseconds
apollo.turns.size - the total number of turns in the session
apollo.external_inputs.size - the total number of inputs
apollo.endpoint_threshold - the pause duration threshold for utterance endpoint detection (in ms)
apollo.appears_incomplete - 0/1 indicates whether the apollo log file is completed properly or not
* apollo.error - contains a description of the error encountered (if parse failed)
* apollo.fatal_error - indicates that the im was terminated with a fatal error
apollo.errors.size - the number of Apollo errors
* apollo.errors.# - the list of various Apollo errors encountered
apollo.warnings.size - the number of Apollo warnings
* apollo.warnings.# - contains a list of Apollo warnings encountered

Timing data

apollo.turns.#.user_utt.start_time - time the user utterance for this turn started
apollo.turns.#.user_utt.end_time - time the user utterance ended
apollo.turns.#.user_utt.duration - duration of the user utterance
apollo.turns.#.user_utt.gap_since_system_stop - time between end of last system prompt and start of user utt #
apollo.turns.#.user_utt.gap_since_system_start - time between start of last system prompt and start of user utt #
apollo.turns.#.user_utt.gap_since_user_stop - time between end of last user utt and the start of user utt #
apollo.turns.#.user_utt.gap_til_system_start - time between end of user utt # and start of next system prompt
apollo.turns.#.user_utt.gap_til_system_stop - time between the end user utt # and end of next system prompt
apollo.turns.#.user_utt.gap_til_user_start - time between the end of user utt # and the start of next user utt (#+1)
apollo.turns.#.user_utt.pauses.size - number of non-speech pauses detected by the VAD during this utt
apollo.turns.#.user_utt.pauses.#.start_time - time the pause started
apollo.turns.#.user_utt.pauses.#.end_time - time the pause ended
apollo.turns.#.user_utt.pauses.#.duration - duration of the pause

Decoder data

apollo.turns.#.user_utt.asr_prepad -
apollo.turns.#.user_utt.asr_skipped_frames -
apollo.turns.#.user_utt.endpointing_threshold -
apollo.turns.#.user_utt.last_valid_uttid -
apollo.turns.#.user_utt.partial_hyps.$i.$k -
apollo.turns.#.user_utt.final_hyp_time -

Prompt data

apollo.turns.#.prompts.size -
apollo.turns.#.prompts.$p.id - -
apollo.turns.#.prompts.$p.info -
apollo.turns.#.prompts.$p.tagged_prompt -
apollo.turns.#.prompts.$p.start_time -
apollo.turns.#.prompts.$p.end_time -
apollo.turns.#.prompts.$p.first_user_speech_start -
apollo.turns.#.prompts.$p.user_speech_start -
apollo.turns.#.prompts.$p.interrupted -
apollo.turns.#.prompts.$p.interruption_time -

Helios

helios.filename - the name of the file being processed
* helios.error - contains a description of the error encountered (if parse failed)
* helios.warnings - contains a list of warnings generated during parse
helios.turns.size - the total number of turns in the session
helios.turns.#.selected_hyp - the hypothesis that got selected
helios.turns.#.hyps.size - the number of hypotheses in the current turn
helios.turns.#.hyps.#.<feature> - the value of a particular feature

Ravenclaw

ravenclaw.filename - name of the file being processed
ravenclaw.start_time - time the session started
ravenclaw.end_time - time the session ended
ravenclaw.session_duration_msec - session duration in milliseconds
ravenclaw.external_inputs.size - total number of inputs (excluding the internally generated events)
ravenclaw.grounding_info - grounding information present? (i.e. if the grounding process was running)
ravenclaw.experimental_condition - experimental condition
* ravenclaw.normal_finish - the session terminated normally (with the RavenClaw core closing properly)
ravenclaw.appears_incomplete - 0/1 indicates whether the ravenclaw log file is completed properly or not
ravenclaw.errors.size - number of RavenClaw errors
* ravenclaw.error - description of the error encountered (if parse failed)
* ravenclaw.fatal_error - indicates that the dm was terminated with a fatal error
* ravenclaw.errors.# - list of various RavenClaw errors encountered
ravenclaw.warnings.size - number of RavenClaw warnings
* ravenclaw.warnings.# - contains a list of RavenClaw warnings encountered
ravenclaw.turns.size - total number of turns in the session
ravenclaw.turns.#.stack - available/unavailable
ravenclaw.turns.#.stack.size - size of the execution stack in a given turn
ravenclaw.turns.#.stack.# - the various agents on the execution stack in a given turn
ravenclaw.turns.#.dialog_state - dialog state at this turn
ravenclaw.turns.#.agenda.size - number of levels in the agenda in a given turn
ravenclaw.turns.#.agenda.level.#.size - number of expectation in a given level of the agenda on a given turn
ravenclaw.turns.#.agenda.level.#.generator - the agent that generated that level in the agenda
ravenclaw.turns.#.agenda.level.#.#.type - O or X according to whether that expectation is open or not
ravenclaw.turns.#.agenda.level.#.#.slot - grammar slot expected
ravenclaw.turns.#.agenda.level.#.#.concept - concept to bind to this concept is in the form (requesting agent name)concept
* ravenclaw.turns.#.agenda.level.#.#.why_blocked - reason why that expectation was blocked
ravenclaw.turns.#.agenda.focus_level - level on which the focus is (most times this is level 0, but there are cases (for instance when an IC is on the stack when the level can be > 0)
ravenclaw.turns.#.start_time - start-time of that particular turn (offset in milliseconds since the beginning of the dialog)
ravenclaw.turns.#.external_policy_polling_time - time spend outside the DM for polling the external policy (this is by and large something that's particular to the roomline experiments)
ravenclaw.turns.#.input.<key> - value (stores input attributes in key-value format)
ravenclaw.turns.#.input.event - if 0 there was no internal event; o/w stores the internal event that generated this input (in lowercase)
ravenclaw.turns.#.output_prompts - comma separated list of prompt ids
ravenclaw.turns.#.output_prompts_dump - comma-separated list of prompts for that turn
ravenclaw.turns.#.output_prompts_dump_expanded - comma-separated list of expanded prompts for that turn
ravenclaw.turns.#.bargein - indicates if the user barged-in on that turn
ravenclaw.turns.#.conveyance_info - indicates the collated conveyance information for that prompt
ravenclaw.turns.#.timeout_period - the timeout period setting for that turn
ravenclaw.turns.#.nonu_threshold - the nonunderstanding threshold for that turn
ravenclaw.turns.#.concepts_dump.available - indicates if the dump of concepts is available in a certain turn
ravenclaw.turns.#.concepts_dump - the list of concepts that were dumped in this turn (separated by \n)
ravenclaw.turns.#.concepts_dump.<concept_name> - <concept_value>
ravenclaw.turns.#.concept_updates - the list of concepts that were updated in this turn (separated by \n) this list might contain repetitions
ravenclaw.turns.#.concept_updates.<concept_name> - the number of updates on that concept
ravenclaw.turns.#.concept_updates.<concept_name>.#.type - the update type
ravenclaw.turns.#.concept_updates.<concept_name>.#.initial - the initial value
ravenclaw.turns.#.concept_updates.<concept_name>.#.updated - the updated value
ravenclaw.turns.#.bindings.size - number of bindings that happened
ravenclaw.turns.#.bindings.#.slot - the slot that bound
ravenclaw.turns.#.bindings.#.value - the value that bound
ravenclaw.turns.#.bindings.#.concept - the concept to which the slot bound this concept name is in the form (requesting agent name)concept
ravenclaw.turns.#.bindings.concepts_bound - number of concepts bound
ravenclaw.turns.#.bindings.concepts_blocked - number of concepts blocked
ravenclaw.turns.#.bindings.forced_updates - number of forced updates (N/A in earlier versions of RavenClaw which did not have forced updates)
ravenclaw.turns.#.grounding.nonu - indicates whether there was a nonunderstanding in that turn
ravenclaw.turns.#.grounding.nonu.type - type of non-understanding (0 - NO_PARSE, 1 - NO_MATCH, 2 - BLOCKED_MATCH, 3 - LOW CONF)
ravenclaw.turns.#.grounding.nonu.action - action taken, or "unavailable" if it cannot be determined
ravenclaw.turns.#.grounding.nonu.seg.start - 0/1 indicates if this nonu is the start of a nonunderstanding segment
ravenclaw.turns.#.grounding.nonu.seg.length - the remaining length of the nonu segment
ravenclaw.turns.#.grounding.nonu.seg.recovered - if this segment got successfully recovered
ravenclaw.turns.#.grounding.concepts.actions - a comma-delimited list of actions taken on different concepts. For example EXPL_CONF(user_name), REQUEST(date)
ravenclaw.turns.#.actions - indicates a comma-delimited list of actions that happened in that turn (the difference with the grounding.concepts.actions field above is that in this case we are really including all the actions that happened, even if they were not explicitly triggered in this turn by the grounding manager. For now, this includes explicit confirms that get repeated as a result of a non-understanding, EXPL_CONF_STACK and task-implicit-confirms IMPL_CONF_TASK (note that in the task implicit confirms for now we are including really all the concepts that are out there, not only the user concepts)

Pitch

f0.f0.num_turns - number of turns found in the session
f0.f0.num_f0 - number of turns for which f0 info was found in the session
*f0.error - indicates an error if it occured
f0.f0.#.available - 0/1 indicates if the transcript is available
f0.f0.#.pitch_mean
f0.f0.#.pitch_min
f0.f0.#.pitch_max
f0.f0.#.pitch_std
f0.f0.#.pitch_min_slope
f0.f0.#.pitch_max_slope
f0.f0.#.pitch_max_incr
f0.f0.#.pitch_min_incr
f0.f0.#.perc_unvoiced
f0.f0.#.prepau
f0.f0.#.num_voiced_segments
f0.f0.#.rate_voiced_segments

Parsed Transcripts

parsed_transcripts_avail - indicates if the transcripts info is available
parsed_transcripts.num_turns - the number of turns found in the session
parsed_transcripts.num_parsed_transcripts - the number of transcripts found in the session
* parsed_transcripts.error - indicates an error if it occured
parsed_transcripts.parsed_transcripts.#.available - 0/1 indicates if the parsed transcript is available
parsed_transcripts.parsed_transcripts.# - the actual transcript

Transcripts

transcripts_avail - indicates if the transcripts info is available
transcripts.num_turns - the number of turns found in the session
transcripts.num_transcripts - the number of transcripts found in the session
*transcripts.error - indicates an error if it occured
transcripts.transcripts.#.available - 0/1 indicates if the transcript is available
transcripts.transcripts.# - the actual transcript
transcripts.transcripts.#.cleaned - the cleaned transcript
transcripts.transcripts.#.ctagged.available - indicates if the ctagged transcript is available
transcripts.transcripts.#.ctagged - the ctagged transcript for that turn
transcripts.transcripts.#.has_feed - indicates that the turn was tagged for feedback (system hearing itself)
transcripts.transcripts.#.has_background - indicates that the turn was tagged for background voices
transcripts.transcripts.#.has_noise - indicates that the turn was tagged for other noises
transcripts.transcripts.#.has_dtmf - indicates that the turn was tagged for DTMF tones
transcripts.transcripts.#.has_mumble - indicates that the turn was tagged for mumble (unintelligible speech)
transcripts.transcripts.#.has_cutin - indicates that the turn was tagged for cut-in (utterance is endpointed before the user has finished speaking)
transcripts.transcripts.#.has_comment - indicates that the turn was tagged as a comment (non-task speech directed to the developpers)
transcripts.transcripts.#.has_remark - indicates that the turn was tagged as a remark (aside or other non-task speech)

Labels

labels_avail - indicates if the labels info is available
labels.num_turns - the number of turns in the labels file
* labels.error - the parsing error (it it occurs)
labels.global.<feature> - <value>
labels.turn.#.<feature> - <value>
Personal tools