Posts Tagged ‘sql’

Database is slow to process large datasets?

July 5, 2009

To learn Python and have a bit of extracurricular fun, I entered an NLP-related competition that was based around a large textual dataset (~4Gb).  The natural solution involved pounding on the problem with Postgres.  The first step involved creating word counts for each document in the database.  I let a script run over the full dataset and, after a few hours, I was surprised to see that it had only just started to make a dent.

After some refactoring, I kicked off the script and left for a short vacation.  When I returned, I was surprised to see that my solution to this simple problem– counting words in text documents– had taken 6.5 days to finish!  I was doing something very wrong.  By making one simple change, I could tokenize the entire dataset in only 4 hours…