MakeLM
From Olympus
Contents |
***MakeLM has been superseded/replaced by logios***
MakeLM is a Perl program that produces a trigram language model and a pronunciation dictionary from a Phoenix grammar. This can be used to bootstrap speech recognition before pronunciation and language model data are available.
Requirements
- Perl is required, with the following modules installed: LWP::UserAgent, HTTP::Request::Common, File::Spec, File::Copy, IO::Handle, Getopt::Long, and IPC::Open2.
- Cygwin is required.
- Either a perl script named cmp.pl or a Windows batch file named cmp.bat in the Resources\Grammar directory. This script will be run from within the Grammar directory and is expected to compile the grammar. If both scripts exist the perl script will be run. The perl script must return true.
Running
Although Cygwin is required, MakeLM may be run either from a Cygwin shell or from a DOS shell. It should be run from the MakeLM directory, since it make assumptions about the file structure.
DOS shell
MakeLM needs to know where cygwin1.dll is. If running from a DOS shell, and Cygwin is installed somewhere other than C:\cygwin, either
- the location of the Cygwin install needs to be specified in the CYGWIN_DIR environment variable, or
- cygwin1.dll needs to be copied somewhere into the DOS Path (e.g. %SYSTEM%\win32).
Here's an example:
C:\MyProject\Tools\MakeLM>set CYGWIN_DIR=C:\mycygwin-install C:\MyProject\Tools\MakeLM>perl makelm.pl
Cygwin shell
Here's an example:
[/cygdrive/c/MyProject/Tools/MakeLM]$ ./makelm.pl
Options
| option | default | notes |
|---|---|---|
| --resourcesdir | ../../Resources | this is where we look for the Grammar directory |
| --samplesize | 30000 | this is the number of generated utterances that we use to build the language model |
| --source | internal lexdata | by default, dictionaries are built using the internal dictionary file ('CMUdict'), but you can specify a LMTool server (e.g. fife) |
| --projectname | the directory name 2 directories up from the current directory |
|
Details
MakeLM takes as input a Phoenix source grammar (.gra), and from that alone it produces three things:
- a compiled Phoenix grammar (.net) for use by the Phoenix parser
- a pronunciation dictionaries (.dict and .dict.reduced_phoneset) for use by Sphinx
- a tri-gram language model (.arpa) for use by Sphinx
The mechanism is as follows, bold items are final products that are used during run-time:
- generic grammar + task grammar → project grammar
- Your task grammar is combined with the Ravenclaw generic grammar. This way you need not worry about grammars for times, numbers, confirmations, et cetera in your task grammar. It amounts to simple concatenation.
- project grammar → compiled grammar + base vocabulary
- The Phoenix compile executable compiles this combined grammar. A base vocabulary is produced as a by-product.
- project grammar → pseudo-corpus
- base vocabulary → vocabulary
- The base vocabulary is cleaned up and formatted for use by the CMU-Cambridge SLM Toolkit.
- pseudo-corpus + vocabulary → tri-grams
- The CMU-Cambridge SLM Toolkit's text2idngram program counts tri-grams in the pseudo-corpus.
- tri-grams → arpa language model
- The CMU-Cambridge SLM Toolkit's idngram2lm program applies Good-Turing smoothing to the tri-grams, and writes out an arpa-formated language model.
- vocabulary → pronunciation dictionary
- The CMUdict is used to produce a pronunciation dictionary.
