To learn Python and have a bit of extracurricular fun, I entered an NLP-related competition that was based around a large textual dataset (~4Gb). The natural solution involved pounding on the problem with Postgres. The first step involved creating word counts for each document in the database. I let a script run over the full dataset and, after a few hours, I was surprised to see that it had only just started to make a dent.
After some refactoring, I kicked off the script and left for a short vacation. When I returned, I was surprised to see that my solution to this simple problem– counting words in text documents– had taken 6.5 days to finish! I was doing something very wrong. By making one simple change, I could tokenize the entire dataset in only 4 hours…