A Tale About Problems of Scale
I earned some new bragging rights this week and had fun doing it. I love computer science!
A tale about iteration, unexpectedly large data sets, and time...
About a month ago, a large file of data unexpectedly needed to be processed, using some pre-existing code written several years ago. This code happened to be written in Java, but this story could apply to most any commonly used programming language. It was a batch process, so speed wasn't of the utmost essence, but at the same time it shouldn't run on and on and one because other files also needed to be processed. The data set was 500K records, whereas the larger data sets normally are in the 20-30K range, so this was more than 10x the norm.
Everything seemed to be going fine. The file was read into a database and some processing had been done on it and the time had come to write out a result file. Along with results, the original programmer had wanted some statistics, so there was a quick iteration through all of the records to gather the statistics and then the results would be written into the file. It seemed straightforward and I didn't expect anything to go wrong. When things go wrong on files, my experience has been that they usually go wrong earlier on.
I waited for the file, and then waited some more. One hour passed, and then another. Where was the file?
Normally, this process will update a status table in the database to indicate what it is currently doing, but it hadn't updated that record since the file writing phase had supposedly begun. Starting to worry that it was hung or deadlocked, I set off to investigate. The CPUs on the server showed that something was going on and it certainly wasn't idle, but they were only 35-45% busy. The memory was not paging. I checked the database and couldn't find any sign of deadlock.
Being a true lover of code, I did the next obvious step from my point of view – I read the code. I found the relevant bit of code and started tracing through that file writer step-by-step. What I discovered was that the statistics gathering phase wasn't written in a particularly optimal way. The developer had probably never imagined that such a large file would ever be processed.
Loop de doop...
What had been coded was a single loop through the records that were stored for the original file. Each record would be read, then another lookup of related data was made, and then a third lookup to map a simple translation was made. All three queries were very simple and used primary keys or indexes, so individually, they all ran very fast. Each lookup resulted in some statistics being updated, and the mapping applied nice text labels to the numbers so that humans could make sense of them. The code was O(n), so not O(1), but not so bad for the usual case.
It took almost 4 hours for the statistics to be gathered for all 500K records, for a total of 1.5 million queries, which worked out to about 0.48ms for each set of three queries. For 25K, that would be about 12 minutes. Unless someone was watching carefully to see the progress, in normal batch processing, 12 minutes would not usually be enough to send up red flags and catch anyone's attention.
Of course, this discovery was just the beginning. Naturally it had to be fixed. That's where the events of this week come in to play. I had some time set aside to look at improving the code this week, so I immediately set about finding a way to let the database do the work. After a few hours of composing and testing queries, I had it! A single query that could gather all of the statistics on the 500K records including the text labels, and do it in 1 minute 41 seconds. That felt good, really good!


