Archive for the ‘research’ Category

Speech Recognition with CMU Sphinx and SRILM language models

October 19, 2009

Speech recognition that does not take advantage of domain specific language models produces silly results.  The recognition menagerie below was generated with Sphinx 4.  I used the HUB4 acoustic model and an SRILM language model generated from the human annotator transcripts.

  • No LM: thusly islam brown ivan and apelike of yellow
  • LM: a seated small brown ottoman and teapot of yellow
  • True: i see a small brown ottoman and a pot of yellow


Using Xournal to Annotate PDFs

September 13, 2009

Just came across an open source tool that allows easy annotation of PDF files: Xournal. One can add text, sketches, and highlighting and then export the result to a PDF.

LaTeX Beamer Presentations

August 29, 2009

I recently tried \LaTeX’s Beamer package to make a presentation. This presentation and the Beamer user guide (PDF) were great references.  I don’t plan on going back to Powerpoint.  The advantages of Beamer are:

  • Professional-looking styles without graphic design tweaking
  • Automatic styling makes it harder to create “busy” slides
  • Source is plain text, so it can be stored easily in a version control system
  • Math typesetting
  • Platform-independent slide format (PDF)

One thing I particularly liked was my two-bullet “key ideas” slide at the beginning of the presentation.  I did identify some areas of improvement, due in part to the academic setting of the presentation.  Restrict the main presentation to the key idea and explain secondary details in appendices.  Mention only those things you’d like to spend time explaining in detail.  Print a one page handout with equations, especially when they are spread across multiple slides.  When an equation is introduced, explain it immediately as it becomes visible.  Try to introduce only one equation (or derivation) per-slide.

Database is slow to process large datasets?

July 5, 2009

To learn Python and have a bit of extracurricular fun, I entered an NLP-related competition that was based around a large textual dataset (~4Gb).  The natural solution involved pounding on the problem with Postgres.  The first step involved creating word counts for each document in the database.  I let a script run over the full dataset and, after a few hours, I was surprised to see that it had only just started to make a dent.

After some refactoring, I kicked off the script and left for a short vacation.  When I returned, I was surprised to see that my solution to this simple problem– counting words in text documents– had taken 6.5 days to finish!  I was doing something very wrong.  By making one simple change, I could tokenize the entire dataset in only 4 hours…