Database is slow to process large datasets?

To learn Python and have a bit of extracurricular fun, I entered an NLP-related competition that was based around a large textual dataset (~4Gb).  The natural solution involved pounding on the problem with Postgres.  The first step involved creating word counts for each document in the database.  I let a script run over the full dataset and, after a few hours, I was surprised to see that it had only just started to make a dent.

After some refactoring, I kicked off the script and left for a short vacation.  When I returned, I was surprised to see that my solution to this simple problem– counting words in text documents– had taken 6.5 days to finish!  I was doing something very wrong.  By making one simple change, I could tokenize the entire dataset in only 4 hours…

Originally, I thought that the slowdown was most likely due to: my ignorance of Python, the unprofessional-looking Python/Postgres driver used, the Python-to-Postgres communications overhead, the sub-optimal configuration of Postgres, or my semi-complicated tokenizing regular expression.  However, speeding up my Python code and the Python-to-Postgres communication didn’t make much of a dent in the total runtime.

Analysis of the script’s logs  (gained by munging with Vim and then visualizing timing trends in a spreadsheet) showed that the moving average of per-document processing times was increasing logarithmically.  This processing time should be constant as long as we’re not reading documents in order of size or complexity.  After some googling, I realized my problem was with the constraints in my beautiful database schema, designed to make Edgar F. Codd proud.  As I added the tokens from each document, the database engine would check to make sure that foreign key and uniqueness constraints were satisfied.  As the number of tokens and the number of (document, token) tuples grew, the database was searching through larger and larger lists.

By creating tables without constraints, loading all my data, then applying the constraints with “ALTER TABLE” commands, the word count finished relatively quickly.  Of course, by initially loading data into a table with constraints I was able to catch some bugs quicker than I otherwise would have by first loading all the data then applying constraints.  The best general strategy is probably to always apply constraints after ingest, but to test on a sufficiently large subset of the corpus to get quick debugging feedback.

tl;dr:

  1. database table constraints can severely slow the ingest and processing of large datasets
  2. database engines are not as smart as they sound in the textbooks

Tags: , ,


%d bloggers like this: