Tokyo Tyrant Throwing a Tantrum

Well, last Friday I posted “So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.”

It didn’t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I’ve tried lots of hacky things but nothing seems to do the trick.  From some Google searching, it seems that perhaps it is a timeout issue, but I can’t see how to modify the perl library to allow for a longer timeout.

So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing.  Now it only crashes out of the loop if that new connector also fails to write.  And I also don’t crash until I save my place in the CSV file, so I don’t repeat myself.  So I’m not getting a complete failure, but it is still super slow.

While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.

Interesting probably unrelated fact.  The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.

Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script.  Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job;  the March jobs are closer together, but that Tyrant job started about an hour before everything else).

I guess there is still a way for Tokyo Tyrant to win this race.  I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data.  It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster.  We’ll see.

 

Tokyo Tyrant is cool

Just to have a recollection of this later, some notes.

setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot

One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I’m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).

There is write delay of course, and that may be all of it, since the files are big now.

Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?

Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.

One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn’t strictly the same data… I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.

So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.

And by the way, I’m sure I’m not the best at this because I haven’t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don’t want to have all of that data sitting in my relational database, but I’m just saying…

 

 

Putting stuff away

Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I’m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!

Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I’ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.

I’m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can’t find it yet.