The CorpusTools set of Perl scripts allows you to manipulate corpora collected from Olympus systems. Its primary purpose is to generate training and testing data from said corpora.
- Put all the raw data you want to use in a directory (organized in subdirectories like [Date]\[Session_Number], including the human transcriptions). If you've been using a Communicator-like dialogue system (and Olympus is such a system), this should already be the structure of your logs.
- Generate transcriptions. VBScribe ay be used for this.
- Call the perl script "collect_corpus.pl" on this directory. This script traverses the directory structure and collects the transcription information and the state information. State is derived from the identifiers of the system utterances (these ids are sent to the NLG by the DM and stored in the log files). The match between prompt id and state is described in a separate text file called a dsm file (also used by the LogParsers). Example:
perl collect_corpus.pl -o data_dir\raw_corpus.txt -d mysystem.dsm source_root_dir
- Use "label_classes.pl" to tag your corpus with class names (the same classes that the Language Model uses) and clean out all non-linguistic information (e.g. indications of feedback, noise, etc). You need the class definition file of your language model (e.g. BusLine_LM.def). This script also outputs a list of all the words found and their frequency in the corpus. You can sort this list and use it to spot transcription errors (because these usually happen only once or twice in the corpus). The "-e" flag let's you request that even empty utterances (e.g. only noise) be output (by default, they are not). Note that this scripts has some hard-coded text processing for the Let's Go! system. It shouldn't prevent it from working with other systems but might not be optimal. This might be fixed in the future... Example:
perl label_classes.pl -var data_dir\BusLine_LM.def
- -voc data_dir\vocacount.txt
- -o data_dir\labeled_corpus.txt
- You can generate subsets of your corpus (e.g. by dialogue state, speaker gender...) using the "make_subset.pl" script. This script takes inclusion or exclusion rules (based on regular expressions) such as
language=\"North American English\". It can also take a control file that contains a list of session ids <date><id> (e.g. 20031110003), one per line, and only selects the utterances that are in the list. The output of this script is in the same format as the input (it's just a subset).
- To generate the control file (e.g. for acoustic training) corresponding to a corpus file, use the script "make_ctlfile.pl". This script can optionally generate files for batch reprocessing of full sessions through a Sphinx-Phoenix-Helios line.
- To generate the transcript file (e.g. for acoustic training or language modeling) corresponding to a corpus file, use the "make_transfile.pl" script.
All these scripts are self documented (just type "perl <name_of_script>" without any arguments).
Format of a corpus file
Each line in a corpus file represents a tagged utterance. It starts with a <s> tag with appropriate properties and ends with a </s> tag. In between is the utterance as transcribed (after step 2 see above) or cleaned from non-linguistic tokens (e.g. noise) and marked up with class information (after step 3).
Example of an utterance (one line in the corpus file) after step 3:
<s dialog_id="20030219008" raw_file="\phonelogs\BusLine\20030219\008\000.raw" state="first_query" language="standard American" gender="male" familiarity="outside user" desync="NO" dont_use_for_training="NO"> U: when is the next sixty_one C. </s>
The same utterance after step 4:
<s dialog_id="20030219008" raw_file="\phonelogs\BusLine\20030219\008\000.raw" state="first_query" language="standard American" gender="male" familiarity="outside user" desync="NO" dont_use_for_training="NO"> WHEN IS THE NEXT <class name=Bus>SIXTY_ONE_C</class> </s>