MakeLM

From Olympus
Jump to: navigation, search

Contents

***MakeLM has been superseded/replaced by logios***

MakeLM is a Perl program that produces a trigram language model and a pronunciation dictionary from a Phoenix grammar. This can be used to bootstrap speech recognition before pronunciation and language model data are available.

Requirements

  • Perl is required, with the following modules installed: LWP::UserAgent, HTTP::Request::Common, File::Spec, File::Copy, IO::Handle, Getopt::Long, and IPC::Open2.
  • Cygwin is required.
  • Either a perl script named cmp.pl or a Windows batch file named cmp.bat in the Resources\Grammar directory. This script will be run from within the Grammar directory and is expected to compile the grammar. If both scripts exist the perl script will be run. The perl script must return true.

Running

Although Cygwin is required, MakeLM may be run either from a Cygwin shell or from a DOS shell. It should be run from the MakeLM directory, since it make assumptions about the file structure.

DOS shell

MakeLM needs to know where cygwin1.dll is. If running from a DOS shell, and Cygwin is installed somewhere other than C:\cygwin, either

  1. the location of the Cygwin install needs to be specified in the CYGWIN_DIR environment variable, or
  2. cygwin1.dll needs to be copied somewhere into the DOS Path (e.g. %SYSTEM%\win32).

Here's an example:

C:\MyProject\Tools\MakeLM>set CYGWIN_DIR=C:\mycygwin-install
C:\MyProject\Tools\MakeLM>perl makelm.pl

Cygwin shell

Here's an example:

[/cygdrive/c/MyProject/Tools/MakeLM]$ ./makelm.pl

Options

option default notes
--resourcesdir ../../Resources this is where we look for the Grammar directory
--samplesize 30000 this is the number of generated utterances that we use to build the language model
--source internal lexdata by default, dictionaries are built using the internal dictionary file ('CMUdict'), but you can specify a LMTool server (e.g. fife)
--projectname the directory name 2 directories up from the current directory
  • the root file name expected for the grammar file (projectname.gra)
  • the root file name assigned to the dictionary (projectname.dict) and language model (projectnameLM.arpa) output files

Details

MakeLM takes as input a Phoenix source grammar (.gra), and from that alone it produces three things:

  • a compiled Phoenix grammar (.net) for use by the Phoenix parser
  • a pronunciation dictionaries (.dict and .dict.reduced_phoneset) for use by Sphinx
  • a tri-gram language model (.arpa) for use by Sphinx

The mechanism is as follows, bold items are final products that are used during run-time:

  1. generic grammar + task grammar → project grammar
    Your task grammar is combined with the Ravenclaw generic grammar. This way you need not worry about grammars for times, numbers, confirmations, et cetera in your task grammar. It amounts to simple concatenation.
  2. project grammar → compiled grammar + base vocabulary
    The Phoenix compile executable compiles this combined grammar. A base vocabulary is produced as a by-product.
  3. project grammar → pseudo-corpus
    The ISL (now interACT) generate_random_samples program generates a corpus of random utterances from the Phoenix grammar. The distribution behaves as if each CFG rule were an evenly-distributed generation. As a result, the language model may be unbalanced.
  4. base vocabulary → vocabulary
    The base vocabulary is cleaned up and formatted for use by the CMU-Cambridge SLM Toolkit.
  5. pseudo-corpus + vocabulary → tri-grams
    The CMU-Cambridge SLM Toolkit's text2idngram program counts tri-grams in the pseudo-corpus.
  6. tri-grams → arpa language model
    The CMU-Cambridge SLM Toolkit's idngram2lm program applies Good-Turing smoothing to the tri-grams, and writes out an arpa-formated language model.
  7. vocabulary → pronunciation dictionary
    The CMUdict is used to produce a pronunciation dictionary.
Personal tools