Archive for the ‘research’ Category

Linear Addition from the Log Domain

February 24, 2013

Some speech recognizers and machine learning algorithms need to quickly calculate the quantity:

\log(x+y)

when given only \log(x) and \log(y). The straightforward algorithm first uses the exponential function to convert the log values to the linear domain representation x and y, then performs the sum, and finally uses the log function to convert back to the log domain:

\log(x+y) = \log(\exp(\log(x)) + \exp(\log(y)))

These conversions between log and linear domains are slow and problems arise when x is too large or small for a machine’s floating point representation.

Luckily, there is a clever approximation method that allows quick, accurate calculation with a relatively small precomputed table.

(more…)

Bash one-liner using sox to batch convert the sampling frequency of audio files

July 12, 2011

A bash one-liner to batch convert the sampling rate of WAV files using the SoX tool.  The example will resample *.wav files in the current directory to 8000Hz and place the output in an existing subdirectory called 8000Hz.

The one-liner below is overkill for this task, but the extra arguments provide a starting point for modification for related tasks.  The use of find/xargs should help the one-liner deal with very large numbers of audio files and filenames that contain whitespace.

Code


find . -maxdepth 1 -name '*.wav' -type f -print0 | xargs -0 -t -r -I {} sox {} -r 8000 8000Hz/{}

(more…)

Sphinx 4 Acoustic Model Adaptation

July 1, 2011

This is a writeup of the steps I took to perform acoustic model adaptation for an acoustic model to be used in Sphinx 4.  I followed the well-written CMU howto.  I performed all steps on a mostly-new Ubuntu 11.04 install and adapted the Communicator acoustic model for use in Sphinx 4.  Keep an eye out for paths that may be different on your system and any error messages that pop up when running these commands.

I also generated a new, full set of adaptation prompt data from the CMU ARCTIC prompts.

(more…)

Wiring a Shure PG30 Headset Microphone

March 15, 2011

I recently tried to rewire a Shure PG30 headset microphone so that it could be connected directly to a computer sound card via a stereo mini plug.  I had limited, but insufficient, success after some amateur tinkering.

(more…)

Situated Language Processing

February 24, 2011

One humanoid robot in a factory is about to be crushed by a falling box; the other is yelling, "Look Out!" (more…)

Second Order Cone Programming with CVXOPT

December 18, 2010

CVXOPT is a convex optimization package for Python that includes a Second Order Cone Programming (SOCP) solver.  The SOCP solver takes a set of matrices that describe the SOCP problem, but these matrices are different than the matrices usually used to express the SOCP problem.  This post walks through the simple algebra steps to find relationship between the two formulations of the SOCP problem.

(more…)

An anagram of “Speech Recognition” is “Incoherence Spigot”

November 27, 2010

That is all.

What is the best sequence to tighten the lug nuts on a wheel with N nuts?

November 10, 2010

Nuts are numbered [0,N) as you travel around the circle, and we’d like to output the list of nut indexes to tighten, in order. For N=4, we’d tighten [0,2,1,4], although there are many equivalent solutions due to the inherent symmetry. For N=5, we have [0,2,4,1,3].

This problem actually came up when I was trying to pick a deterministic “maximally distant sequence” of colors for a visualization that I was doing. There are answers good enough (or better) for my visualization problem, but I feel like the question is an interesting one. One thing that probably needs doing is defining the objective function (i.e., “maximally distant”) in a more formal way.

This post is intended to shame my future self into looking at it again, one day!

Speech Recognition Performance at NIST

November 5, 2010

The folks at NIST have plotted speech recognition competition performance over time on various datasets.  If you squint at the log-scale y-axis, it looks like we’re not making steady progress on the really complicated (i.e., real-life) datasets as one might hope.  On the other hand, this graph lists only NIST evaluations, which seem to have focused primarily on one particularly challenging type of speech for the past few years: meeting speech.

How to compile SRILM on Ubuntu

November 3, 2010

EDIT: This is my original post, but the comments have newer and better instructions.

I always encounter problems when compiling SRILM on Ubuntu.  Assuming the basic SRILM dependencies are installed on your system (see the Prerequisites), this works for SRILM 1.5.11 on Ubuntu 9.04 (Jaunty) and 10.04 (Lucid):

  1. Install tcsh if not already installed
  2. Install all the TCL developer libraries: tcl8.4-dev, tcl-dev, tcl-lib, tclx8.4, tclx8.4-dev.  This step may not be necessary, let me know what works for you.
  3. Uncomment the “SRILM =” line in the top level Makefile and replace the existing path with the absolute path of the SRILM top-level directory on your system (where the Makefile resides)
  4. Start the tcsh shell
  5. Type “make NO_TCL=X MACHINE_TYPE=i686-gcc4 World > & make.log.txt” to begin the build and capture stderr and stdout in a file
  6. If you can run “./bin/i686-gcc4/ngram-count -help“, the build was probably a success

Please add simplifications to this recipe or extensions to other versions of Ubuntu in the comments.