I have an application that is taxing my PostgreSQL install, and I’ve been taking a whack at using CouchDB to solve it instead.
On the surface, it looks like a pretty good use case, but I’m having trouble getting it to move fast enough.
In a nutshell, I am storing the output of a multiple imputation process. At the moment my production system uses PostgreSQL for this. I store each imputation output, one record per row. I have about 360 million imputation stored this way.
Each imputation represents an estimate of conditions at a mainline freeway detector. That is done in R using the excellent Amelia package. While the imputation is done for all lanes at the site, because I am storing the data in a relational database with a schema, I decided to store one row per lane.
The problem with that approach is that I have to run a nested select statement to use the data. First I have to go through and collect all the lanes for a site for a given timestamp, and then average over the multiple imputations. Then with the average of the imputations (or whatever is relevant to the problem), I would then sum up (or average or weight) the results across times for the detector. So if I wanted to
Each imputation run computes an entire year, and saves the output to a CSV file. I would then parse that file and save to the database. My initial attempt to copy this to couchdb was a straight copy, with a slight twist. I just saved the entire year of imputation output to the database as a document.
Then when I next returned to it, I realized that CouchDB didn’t have the same schema restriction as PostgreSQL, so I could just save all of the lanes in a single row of data. So that was version 2. I saved one document per imputation per time stamp per detector, merging out the per lane part when loading up the database from the CSV dump. This is why I know I have 360 million odd documents in PostgreSQL…I generated 90 million or so documents in CouchDB, and my view was taking an age to generate even with the documents split between 4 different databases. I also ran into the “you’re not reducing fast enough” scold. The problem is that sometimes a site only has one or two imputations, while other sites have 4 and 5. But I really wanted to run the reduce step for convenience. I wanted to just get a document that summarized all of the multiple imputations, rather than forcing my app to do the summary work.
That said, it is only a handful of imputations per site, so in this case I think the reduce is an incredible waste of computation power. Cleaner design, but death by 90 million cuts.
I also decided to try to bulk up the documents a bit. Rather than one document per time stamp per detector per imputation, I am currently trying to store 100 imputations per document, with no attempt at all to order or sort those 100…I just slice them off the docs array and reformat things prior to calling bulk docs. My idea is to try to speed up the view generation by dividing by 100 the number of times that CouchDB has to serve up, parse, and return a JSON document.
However, this seems to be triggering another bug in CouchDB, but it is hard to detect. I have 4 processes sending data to the CouchDB server. 3 of them are on the same machine as the CouchDB server, and the other two are on two different machines, because I spread the imputation jobs around. Now CouchDB will just crash and restart periodically. I think my choice of 100 documents was perhaps too high, and the bulk docs server is eating up just a little bit too much RAM so that it eventually gets killed.
As soon as the uploading stops I will run my views and see how fast they go now with the bigger doc size and the lack of a reduce. I ran it on 1000 docs and it was reasonably fast, but it is hard to tell from that few documents.