Tokyo Tyrant Throwing a Tantrum

Well, last Friday I posted “So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.”

It didn’t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I’ve tried lots of hacky things but nothing seems to do the trick.  From some Google searching, it seems that perhaps it is a timeout issue, but I can’t see how to modify the perl library to allow for a longer timeout.

So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing.  Now it only crashes out of the loop if that new connector also fails to write.  And I also don’t crash until I save my place in the CSV file, so I don’t repeat myself.  So I’m not getting a complete failure, but it is still super slow.

While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.

Interesting probably unrelated fact.  The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.

Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script.  Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job;  the March jobs are closer together, but that Tyrant job started about an hour before everything else).

I guess there is still a way for Tokyo Tyrant to win this race.  I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data.  It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster.  We’ll see.


Tokyo Tyrant is cool

Just to have a recollection of this later, some notes.

setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot

One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I’m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).

There is write delay of course, and that may be all of it, since the files are big now.

Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?

Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.

One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn’t strictly the same data… I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.

So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.

And by the way, I’m sure I’m not the best at this because I haven’t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don’t want to have all of that data sitting in my relational database, but I’m just saying…



Putting stuff away

Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I’m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!

Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I’ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.

I’m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can’t find it yet.

Swinger is cool, Sammy looks cooler

Just tried out swinger.  It is cool.  But I can’t get authorization to work right using the trunk checkout of couch (0.11.blahblah_git).  Something to hack on

But I’m more interested in playing with Sammy.js.  The two application stack figures on the blog page (and in the Swinger slides) are interesting.  Take away the couchdb bit, add Sakai’s K2, and you’ve got a very similar picture.  Sure couchdb can serve the app with attachments to the _design doc, but that’s not the point.  The point is being able to stick documents into a db and then get them out again in interesting ways without having to bend over backwards on the server side.

But again, I have to play with it for a while and see what it can do.

Bootstrap in a view

Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I’ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can’t do that—my input data is too big.  So I have to pick a smaller n.  So I’m going with 1,000 for starters, and repeat the draw 10,000 times.

(There’s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I’m not going to dive into that yet.) Continue reading

A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

Continue reading