Sphinx 4 Acoustic Model Adaptation

This is a writeup of the steps I took to perform acoustic model adaptation for an acoustic model to be used in Sphinx 4.  I followed the well-written CMU howto.  I performed all steps on a mostly-new Ubuntu 11.04 install and adapted the Communicator acoustic model for use in Sphinx 4.  Keep an eye out for paths that may be different on your system and any error messages that pop up when running these commands.

I also generated a new, full set of adaptation prompt data from the CMU ARCTIC prompts.

Build SphinxBase

First download and build SphinxBase.  Since I had a relatively new Ubuntu install, I had to sudo apt-get install autoconf libtool bison.

svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase
cd sphinxbase
make check
sudo make install
sudo ldconfig -v

Thank you to Mnemonic Place for figuring out the last step (with ldconfig).  Without this, you’ll see errors when trying to run the sphinx_fe command, later.

Build SphinxTrain

svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/SphinxTrain
cd SphinxTrain

Record Adaptation Training Data

Record/obtain the 20 Arctic WAV files that are used in the CMU howto. Because the Communicator acoustic model is 8kHz, I recorded the WAV’s as 8kHz, 16-bit in Audacity and used the “multiple export” function to batch save. I rename‘d them to match the format that is compatible with their howto, e.g., “arctic_0001.wav”.  (Not quite as much fun as the Dave Barry passage that you get to read during Dragon’s adaptation process :] )

Miscellaneous File Downloading/Arranging

Download the four howto training files from the CMU wiki page.  One of the files is listed as “arctic20.fileids” but saves as “arctic20.listoffiles”; I just renamed it to “arctic20.fileids” so as to not cause confusion with later scripts and instructions.  If you need more than 20 prompts for adaptation, you may like to draw them from the full set of CMU ARCTIC prompts I generated.

Now move the four training files, your 20 arctic WAV files, the acoustic model you’re going to use to one directory.  My acoustic model is in a subdirectory named Communicator_40.cd_cont_4000/.  In my case, that one directory is a sibling of the sphinxbase and SphinxTrain directories.

Generate Features

sphinx_fe -argfile Communicator_40.cd_cont_4000/feat.params \
   -samprate 8000 -c arctic20.fileids -di . -do . \
   -ei wav -eo mfc -mswav yes

Gather Statistics

First copy over binaries from SphinxTrain:

cp ../SphinxTrain/bin.i686-pc-linux-gnu/bw .
cp ../SphinxTrain/bin.i686-pc-linux-gnu/map_adapt .
cp ../SphinxTrain/bin.i686-pc-linux-gnu/mk_s2sendump .

Find a copy of the “fillerdict” you use and copy it to the acoustic model directory, renaming it to the filename “noisedict”.

Run the Baum-Welch program:

./bw \
   -hmmdir Communicator_40.cd_cont_4000 \
   -moddeffn Communicator_40.cd_cont_4000/mdef \
   -ts2cbfn .cont. \
   -feat 1s_c_d_dd \
   -cmn current \
   -agc none \
   -dictfn arctic20.dic \
   -ctlfn arctic20.fileids \
   -lsnfn arctic20.transcription \
   -accumdir .

Perform MLLR training

Note that MLLR isn’t supported in Sphinx 4, so skip this step. “The Heiroglyph” document makes a point that you can iteratively build the mllr_matrix by running bw, then mllr_solve, then bw, then mllr_solve, etc. If you’re using a system that can take advantage of the mllr_matrix, you can run bw after you have your final mllr_matrix to generate MLLR-adapted statistics for the MAP adaptation described later.

cp ../SphinxTrain/bin.i686-pc-linux-gnu/mllr_solve .
./mllr_solve \
   -meanfn Communicator_40.cd_cont_4000/means \
   -varfn Communicator_40.cd_cont_4000/variances \
   -outmllrfn mllr_matrix -accumdir .

When the mllr_matrix exists, you can re-calculate statistics as mentioned above with the command:

./bw \
   -hmmdir Communicator_40.cd_cont_4000 \
   -moddeffn Communicator_40.cd_cont_4000/mdef \
   -ts2cbfn .cont. \
   -feat 1s_c_d_dd \
   -cmn current \
   -agc none \
   -mllrmat mllr_matrix \
   -dictfn arctic20.dic \
   -ctlfn arctic20.fileids \
   -lsnfn arctic20.transcription \
   -accumdir .

Perform MAP training

Sphinx4 can benefit from MAP training, but it is labor-intensive at the moment. MAP training requires accurate transcripts of the adaptation prompts, and may degrade performance if the wrong dictionary pronunciation is used for an audio recording. For example, there are two pronunciations for “a” in the cmudict: “uh” and “ay”. Unless the transcription is annotated with the correct varient (“A” or “A(2)”), the MAP training can degrade the acoustic models involving that phoneme.

The suggested solution to this problem is to either perform forced alignment or hand-transcribe the data. I haven’t tried methods for force-alignment, yet. Human annotators could use a file included in my version of the ARCTIC prompts: it lists all alternative pronunciations inline in the transcription file. This is slow-going and error-prone, but it may be useful if you’re trying to adapt an acoustic model for personal use?

In any case, the commands to perform the MAP adaptation are below.

cp ../SphinxTrain/bin.i686-pc-linux-gnu/map_adapt .
cp -r Communicator_40.cd_cont_4000/ Communicator_40.cd_cont_4000.adapted
./map_adapt \
    -meanfn Communicator_40.cd_cont_4000/means \
    -varfn Communicator_40.cd_cont_4000/variances \
    -mixwfn Communicator_40.cd_cont_4000/mixture_weights \
    -tmatfn Communicator_40.cd_cont_4000/transition_matrices \
    -accumdir . \
    -mapmeanfn Communicator_40.cd_cont_4000.adapted/means \
    -mapvarfn Communicator_40.cd_cont_4000.adapted/variances \
    -mapmixwfn Communicator_40.cd_cont_4000.adapted/mixture_weights \
    -maptmatfn Communicator_40.cd_cont_4000.adapted/transition_matrices


And that’s it! All of these commands should terminate within about one second.


To test whether or not these commands actually did anything, I generated some new/old recognition results.  Test data were the WAV files used for adaptation and I used the 5K NVP 3-gram ARPA language model avaiable from Keith Vertanen’s site.  The acoustic models were the original Communicator model and the MAP adapted model (but I didn’t use the MLLR transform).

This doesn’t tell us anything definite about the performance of the acoustic model in our application, but it shows that the adaptation did do something. Below are the first few recognition results: the first line is using the  original acoustic model (OLD), the second line is using the MAP-adapted acoustic model (NEW), and the third line is the gold-standard transcription (GLD).

OLD: all circuit egypt around philip still successor
NEW: author of the danger trail philip feels it set or a
GLD: author of the danger trail, philip steels et cetera

OLD: not efficiency killer case tom unfair work
NEW: not at this particular case thomas politics with more
GLD: not at this particular case tom apologized whittemore

OLD: further twentieth time at evening into mexican
NEW: for the twentieth time that evening the two men sugar
GLD: for the twentieth time that evening the two men shook hands

Other Resources

A document called “The Hieroglyph” talks a bit more about adaptation and makes a few good points about the number of utterances and the care with which they are transcribed. Some suggestions from that document were incorporated into this post.

Good luck!

Tags: , ,

10 Responses to “Sphinx 4 Acoustic Model Adaptation”

  1. Roger Says:

    Hi there. I’m interested to adapt the standard WSJ acoustic model to be able to understand another language. It is in my understanding that this is possible as long as there are not too much words to be recognized. What do you think about this? In all honesty, I don’t have the time nor resources to build a new acoustic model from scratch, and I only need this to recognize a few words, namely about 10-20 words.

    And I have another question, is it possible to do this fully in Windows? Without Cygwin to be precise. And if it’s not possible, is it recommended using small distribution of Linux like DamnSmallLinux for this process? I say this because my knowledge about Linux, its environment, and commands are next to nil.

    Thank you.

    • romanows Says:

      One way to accomplish this is to determine which English phoneme sequences best fit the words in your target language. I believe the general ideas have been discussed in the Sphinx 4 forums under “phonetic transcription” or “phoneme loop”: http://sourceforge.net/projects/cmusphinx/forums. I tried searching, but SourceForge seems to be having problems, now.

      The key to this approach is a “phoneme loop”, which tricks Sphinx into recognizing phoneme sequences instead of the normal word sequences. It can be implemented as a grammar language model where any phoneme can be followed by any other phoneme. In JSGF, this would look something like “S -> (_ah_ | _eh_ | _oh_ | … | _z_) <S>*”. Here, “_ah_” is a word in the dictionary that has the pronunciation “AH”.

      If you perform recognition using this phoneme loop language model, Sphinx will find a sequence of English phonemes that fit the words in your target language vocabulary. The output will hopefully look like ” _ah_ _t_ _dh_ …” (try playing with things like the silence insertion probabilities if your results are poor). Then, take these sequences and use them as pronunciations in a new dictionary: “NEWWORD AH T DH”.

      Since you have a small vocabulary, it might actually work. Good luck!

  2. Roger Says:

    Thank you very much for suggesting this idea. It something that once crossed my mind but I don’t think it could work. And with you said that it might, I’m convinced to try it. But, I have a little problem in making this work. So, I used Sphinx-4 HelloWorld sample code and edit the grammar file with this :

    <S> = (AA| AE| AH| AO| AW| AY| B| CH| D| DH| EH| ER| EY| F| G| HH| IH| IY| JH| K| L| M| N| NG| OW| OY| P| R| S| SH| T| TH| UH| UW| V| W| Y| Z| ZH);

    public = ([<S>][<S>][<S>][<S>][<S>][<S>]);

    Not exactly wrote it like you’ve suggested but I think it could work. The problem is, the compile process takes FOREVER (I’m using Eclipse btw). I’ve changed the -Xmx into 1024 and have been waiting for the last hour but the program hasn’t run yet. I know this is sounds noobish, and I am actually, but could you suggest another writing format that might be less resource-hungry and doing just the thing needed?

    edited by romanows: fixed formatting

    • romanows Says:

      Sphinx isn’t very smart about how it represents a JSGF grammar graph in memory. The following will work:

      #JSGF V1.0 utf-8 en;
      grammar phonemeLoop;

      public <WILDCARD> = <PHONEMES>+;
      <PHONEMES> = "<SIL>" | ah | n | s | ih | l | t | r | k | iy | d | m | er | z | eh | aa | ae | b | p | ow | f | g | ey | ay | ao | v | ng | uw | hh | w | sh | jh | y | ch | aw | th | uh | oy | dh | zh;

      And then your dictionary is just AH AH and so on. Sphinx happens to compile this in a smarter way than what you had originally written.

      As for whether or not this will work… that is a good question! With this approach, you’re relying on Sphinx to find the best word sequence given the audio and the acoustic/language models. The “best result” will be very bad in an absolute sense, but all you need is the correct ranking.

      One way this could fail is that, for multiple recordings of one word, you get a jumble of phonemes with no obvious structure. It is not a positive sign, although you should still try to continue and see whether one of those pronunciations will work. Start this process with a very small number of different-sounding words; that’ll let you fail or succeed quickly.

      Another way this might fail is if you’re trying to recognize the different words “ma”, “ma”, “ma”, “ma”, and “ma” in Chinese :) Hopefully, your target language isn’t tonal!

      The final way this might fail is through speaker dependence. You may find that the system works for the speakers used during this process, but not on new speakers. This might happen if the phoneme sequences you find don’t seem to “sound” at all like the real words in your target language.

      I’d really appreciate it if you could post a follow up comment when you’re done experimenting to say whether it worked or not, and what language you were trying to work with. Thanks!

      • Roger Says:

        Thank you for going through all that problem answering my questions. I used your JSGF format and as you’ve said it runs albeit a little slow. I’m trying to make Sphinx understand Indonesian language btw.

        The problem is, although the results from the same one word indeed has some obvious structure, but it is not definite enough – in my opinion – to put those pronunciation in the dictionary. I’ve even tried some English words and get results quite different from one stated in the default Dictionary. I don’t know if it’s because of my pronunciation or the Sphinx recognizer just doesn’t work that way. I also afraid of what you’ve said about this method is speaker dependent because I have this feeling (haven’t tried this either) that the results could differ if say, a friend does the speaking.

        Fortunately, when you’ve said about tonal language, I did a little digging about Indonesian language and figure out that it is in fact quite a simple language (compared to Chinese for example), so I make my own pronunciation and write it directly in the dictionary. And I tried to make alternatives as many as possible. For example for a word : “MERIAM” – means “CANNON”, I wrote eleven alternatives:

        MERIAM(2) M EY R IY AA M
        MERIAM(3) M EH R IY Y AA M
        MERIAM(4) M EY R IY Y AA M
        MERIAM(5) M ER IY AA M
        MERIAM(6) M ER R IY AA M
        MERIAM(7) M EH R IY Y AH M
        MERIAM(8) M EY R IY Y AH M
        MERIAM(9) M EH R IY EH M
        MERIAM(10) M EY R IY EH M
        MERIAM(11) M ER R IY Y EH M

        And the result? Works wonders. Very accurate and low error rate. Of course, I have to really limit the grammar but for what I needed (13 words exactly), this is more than enough.

        So, that’s it. I know it’s not a scientific way to do this but hey, it worked. And I thank you very much for guiding me through this process. How glad I am to not have to go through those adapting and training process for just a few words.

        Have a good day!

  3. Roger Says:

    Inside the empty bracket there are “”s. I don’t know why it doesn’t print. Is it HTML or something? God knows. Sorry for this.

  4. Roger Says:

    I’m very sorry, It’s “less-than” “alphabet” “more-than”. Things inside those sharp brackets just don’t print!

    • romanows Says:

      I had a problem with that in my comment, too :) You can use the HTML escape codes &lt; for “less-than” and &gt; for “more-than”. I edited your comment to put these things back in.

  5. Himanhsu Says:

    what to do further after adapting acoustic model
    i tried to run command after adaptation
    pocketsphinx_continuous -hmm -lm -dict -infile test.wav
    but i am unable to find what this command actually does.
    any help here??

    • romanows Says:

      No, sorry, I haven’t used pocketsphinx, so I wouldn’t have any insight into this command beyond the documentation that shows up when you google it. Good luck!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: