I have an idea. I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/). So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading
Well, last Friday I posted “So, slotting 4 months of data away. I’ll check it again on Monday and see if it worked.”
It didn’t. Actually I checked later that same day and all of my jobs had died due to recv errors. I’ve tried lots of hacky things but nothing seems to do the trick. From some Google searching, it seems that perhaps it is a timeout issue, but I can’t see how to modify the perl library to allow for a longer timeout.
So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing. Now it only crashes out of the loop if that new connector also fails to write. And I also don’t crash until I save my place in the CSV file, so I don’t repeat myself. So I’m not getting a complete failure, but it is still super slow.
While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.
Interesting probably unrelated fact. The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.
Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script. Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job; the March jobs are closer together, but that Tyrant job started about an hour before everything else).
I guess there is still a way for Tokyo Tyrant to win this race. I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data. It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster. We’ll see.