Tokyo Tyrant is cool

Just to have a recollection of this later, some notes.

setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot

One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I’m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).

There is write delay of course, and that may be all of it, since the files are big now.

Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?

Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.

One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn’t strictly the same data… I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.

So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.

And by the way, I’m sure I’m not the best at this because I haven’t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don’t want to have all of that data sitting in my relational database, but I’m just saying…




Putting stuff away

Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I’m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!

Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I’ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.

I’m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can’t find it yet.

Related to someone getting more and more almost famous by the day

Well my mother-in-law’s interview with Tavis Smiley still hasn’t been broadcast (perhaps they are saving it for February?), but she got a very good review from the Washington Times dated Oct 8,2009.  Of course, the internet being the internet, it has totally fallen off the front page of the book review section and even the Military History section, but lives on in the hard disk cache in the sky.  If you google “Escaped Slaves and the Union Navy” you get right to the review page by Gordon Berg.

It is interesting to me that it takes a third book to start getting positive buzz that goes beyond friends and acquaintances.   While the topic helps a little bit in that with Obama in the White House people are taking a fresh look at black history in our nation, I don’t think that is entirely all of it.  Her book on “G.I. Nightingales” was also pretty good, and should have been just as popular, but didn’t get the buzz.  Nor is it just that after three books one’s writing is bound to improve.  Perhaps it is just that with three books reviewers are more likely to review a book, and the publisher is more likely to get more traction marketing the book.

Maybe the next book will be optioned by Hollywood, then we’ll really be related to somebody famous!

Or maybe I will write four books on transportation engineering and get a movie made.

Or maybe one of the girls will finally write the book with the title “The Moon is the Nightime Sun” that they’ve been on about since they were 5…

Using psql copy from DBIx::Class

I am loading up lots and lots of data, and need to track what is going on, but I really don’t need all of the stuff that DBIx::Class brings with it.  So I got a clue today and decided I was just going to use copy directly, picking off the file, gunzip-ping it, and using system to execute a psql copy call.

But, when I went to edit my code, I realized that I forgot about stuff like passwords and ports and hosts and all that junk that is nice to have in a portable perl script.

CPAN docs to the rescue!  Continue reading

Not yet almost famous

I learned something pretty interesting last weekend, as we visited my in-laws and we finally got our copy of Barbara’s latest book on the civil war.  Apparently, Tavis Smiley tapes his shows in advance, and then airs the segments when they fit.  I always assumed that these radio talk shows were just non-stop live craziness, with people walking into the studio, sitting down, and then the interviewer doing the interview.  I guess if that were true, Tavis Smiley, and all the rest of the NPR personalities would have superhuman stamina and vivacity.   Or else drink a lot of coffee.

Anyway, not that anybody who knows me or our family or knows Barbara or cares about freed slaves in the Civil War actually reads this blog (well, amend that to “not that anybody actually reads this blog”) but Barbara was *not* on the Tavis Smiley show last week, she was taped and will be on the show in the future when Tavis has a slot in which the interview fits.

So stay tuned sports fans.