Archive for the ‘speech’ Category

Linear Addition from the Log Domain

February 24, 2013

Some speech recognizers and machine learning algorithms need to quickly calculate the quantity:


when given only \log(x) and \log(y). The straightforward algorithm first uses the exponential function to convert the log values to the linear domain representation x and y, then performs the sum, and finally uses the log function to convert back to the log domain:

\log(x+y) = \log(\exp(\log(x)) + \exp(\log(y)))

These conversions between log and linear domains are slow and problems arise when x is too large or small for a machine’s floating point representation.

Luckily, there is a clever approximation method that allows quick, accurate calculation with a relatively small precomputed table.



Bash one-liner using sox to batch convert the sampling frequency of audio files

July 12, 2011

A bash one-liner to batch convert the sampling rate of WAV files using the SoX tool.  The example will resample *.wav files in the current directory to 8000Hz and place the output in an existing subdirectory called 8000Hz.

The one-liner below is overkill for this task, but the extra arguments provide a starting point for modification for related tasks.  The use of find/xargs should help the one-liner deal with very large numbers of audio files and filenames that contain whitespace.


find . -maxdepth 1 -name '*.wav' -type f -print0 | xargs -0 -t -r -I {} sox {} -r 8000 8000Hz/{}


Sphinx 4 Acoustic Model Adaptation

July 1, 2011

This is a writeup of the steps I took to perform acoustic model adaptation for an acoustic model to be used in Sphinx 4.  I followed the well-written CMU howto.  I performed all steps on a mostly-new Ubuntu 11.04 install and adapted the Communicator acoustic model for use in Sphinx 4.  Keep an eye out for paths that may be different on your system and any error messages that pop up when running these commands.

I also generated a new, full set of adaptation prompt data from the CMU ARCTIC prompts.


Wiring a Shure PG30 Headset Microphone

March 15, 2011

I recently tried to rewire a Shure PG30 headset microphone so that it could be connected directly to a computer sound card via a stereo mini plug.  I had limited, but insufficient, success after some amateur tinkering.


An anagram of “Speech Recognition” is “Incoherence Spigot”

November 27, 2010

That is all.

Speech Recognition Performance at NIST

November 5, 2010

The folks at NIST have plotted speech recognition competition performance over time on various datasets.  If you squint at the log-scale y-axis, it looks like we’re not making steady progress on the really complicated (i.e., real-life) datasets as one might hope.  On the other hand, this graph lists only NIST evaluations, which seem to have focused primarily on one particularly challenging type of speech for the past few years: meeting speech.

How to compile SRILM on Ubuntu

November 3, 2010

EDIT: This is my original post, but the comments have newer and better instructions.

I always encounter problems when compiling SRILM on Ubuntu.  Assuming the basic SRILM dependencies are installed on your system (see the Prerequisites), this works for SRILM 1.5.11 on Ubuntu 9.04 (Jaunty) and 10.04 (Lucid):

  1. Install tcsh if not already installed
  2. Install all the TCL developer libraries: tcl8.4-dev, tcl-dev, tcl-lib, tclx8.4, tclx8.4-dev.  This step may not be necessary, let me know what works for you.
  3. Uncomment the “SRILM =” line in the top level Makefile and replace the existing path with the absolute path of the SRILM top-level directory on your system (where the Makefile resides)
  4. Start the tcsh shell
  5. Type “make NO_TCL=X MACHINE_TYPE=i686-gcc4 World > & make.log.txt” to begin the build and capture stderr and stdout in a file
  6. If you can run “./bin/i686-gcc4/ngram-count -help“, the build was probably a success

Please add simplifications to this recipe or extensions to other versions of Ubuntu in the comments.

Speech Recognition with CMU Sphinx and SRILM language models

October 19, 2009

Speech recognition that does not take advantage of domain specific language models produces silly results.  The recognition menagerie below was generated with Sphinx 4.  I used the HUB4 acoustic model and an SRILM language model generated from the human annotator transcripts.

  • No LM: thusly islam brown ivan and apelike of yellow
  • LM: a seated small brown ottoman and teapot of yellow
  • True: i see a small brown ottoman and a pot of yellow