Contour Line

November 11, 2009

RJSONIO to process CouchDB output

Filed under: couchdb, research, transportation — jmarca @ 2:23 pm

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events.

 

 

November 2, 2009

Tokyo Tyrant Throwing a Tantrum

Filed under: couchdb, tokyocabinet — jmarca @ 9:32 pm

Well, last Friday I posted “So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.”

It didn’t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I’ve tried lots of hacky things but nothing seems to do the trick.  From some Google searching, it seems that perhaps it is a timeout issue, but I can’t see how to modify the perl library to allow for a longer timeout.

So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing.  Now it only crashes out of the loop if that new connector also fails to write.  And I also don’t crash until I save my place in the CSV file, so I don’t repeat myself.  So I’m not getting a complete failure, but it is still super slow.

While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.

Interesting probably unrelated fact.  The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.

Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script.  Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job;  the March jobs are closer together, but that Tyrant job started about an hour before everything else).

I guess there is still a way for Tokyo Tyrant to win this race.  I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data.  It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster.  We’ll see.

 

October 30, 2009

Tokyo Tyrant is cool

Filed under: couchdb, tokyocabinet — jmarca @ 10:30 pm

Just to have a recollection of this later, some notes.

setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot

One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I’m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).

There is write delay of course, and that may be all of it, since the files are big now.

Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?

Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.

One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn’t strictly the same data… I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.

So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.

And by the way, I’m sure I’m not the best at this because I haven’t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don’t want to have all of that data sitting in my relational database, but I’m just saying…

 

 

October 26, 2009

Putting stuff away

Filed under: couchdb, tokyocabinet — jmarca @ 8:34 am

Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I’m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!

Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I’ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.

I’m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can’t find it yet.

October 23, 2009

Related to someone getting more and more almost famous by the day

Filed under: civil war history — jmarca @ 2:14 pm

Well my mother-in-law’s interview with Tavis Smiley still hasn’t been broadcast (perhaps they are saving it for February?), but she got a very good review from the Washington Times dated Oct 8,2009.  Of course, the internet being the internet, it has totally fallen off the front page of the book review section and even the Military History section, but lives on in the hard disk cache in the sky.  If you google “Escaped Slaves and the Union Navy” you get right to the review page by Gordon Berg.

It is interesting to me that it takes a third book to start getting positive buzz that goes beyond friends and acquaintances.   While the topic helps a little bit in that with Obama in the White House people are taking a fresh look at black history in our nation, I don’t think that is entirely all of it.  Her book on “G.I. Nightingales” was also pretty good, and should have been just as popular, but didn’t get the buzz.  Nor is it just that after three books one’s writing is bound to improve.  Perhaps it is just that with three books reviewers are more likely to review a book, and the publisher is more likely to get more traction marketing the book.

Maybe the next book will be optioned by Hollywood, then we’ll really be related to somebody famous!

Or maybe I will write four books on transportation engineering and get a movie made.

Or maybe one of the girls will finally write the book with the title “The Moon is the Nightime Sun” that they’ve been on about since they were 5…

October 13, 2009

Using psql copy from DBIx::Class

Filed under: code — jmarca @ 4:04 pm

I am loading up lots and lots of data, and need to track what is going on, but I really don’t need all of the stuff that DBIx::Class brings with it.  So I got a clue today and decided I was just going to use copy directly, picking off the file, gunzip-ping it, and using system to execute a psql copy call.

But, when I went to edit my code, I realized that I forgot about stuff like passwords and ports and hosts and all that junk that is nice to have in a portable perl script.

CPAN docs to the rescue!  (more…)

October 9, 2009

Not yet almost famous

Filed under: civil war history — jmarca @ 8:42 am
Tags: , , ,

I learned something pretty interesting last weekend, as we visited my in-laws and we finally got our copy of Barbara’s latest book on the civil war.  Apparently, Tavis Smiley tapes his shows in advance, and then airs the segments when they fit.  I always assumed that these radio talk shows were just non-stop live craziness, with people walking into the studio, sitting down, and then the interviewer doing the interview.  I guess if that were true, Tavis Smiley, and all the rest of the NPR personalities would have superhuman stamina and vivacity.   Or else drink a lot of coffee.

Anyway, not that anybody who knows me or our family or knows Barbara or cares about freed slaves in the Civil War actually reads this blog (well, amend that to “not that anybody actually reads this blog”) but Barbara was *not* on the Tavis Smiley show last week, she was taped and will be on the show in the future when Tavis has a slot in which the interview fits.

So stay tuned sports fans.

September 30, 2009

Swinger is cool, Sammy looks cooler

Filed under: couchdb, sakai — jmarca @ 10:02 pm

Just tried out swinger.  It is cool.  But I can’t get authorization to work right using the trunk checkout of couch (0.11.blahblah_git).  Something to hack on

But I’m more interested in playing with Sammy.js.  The two application stack figures on the blog page (and in the Swinger slides) are interesting.  Take away the couchdb bit, add Sakai’s K2, and you’ve got a very similar picture.  Sure couchdb can serve the app with attachments to the _design doc, but that’s not the point.  The point is being able to stick documents into a db and then get them out again in interesting ways without having to bend over backwards on the server side.

But again, I have to play with it for a while and see what it can do.

September 29, 2009

Barbara Tomblin is getting interviewed on Tavis Smiley show!

Filed under: civil war history — jmarca @ 9:49 am
Tags: , , ,

Crazy news.  After toiling away for a few years researching a book on escaped civil war slaves and their role in the Union Navy’s blockade of the South in the Civil War, and after push push pushing to get it published, my mother in law, Barbara Tomblin finally got her third (I think) book published.  Then, before she’s even given her daughter a copy  to put on our book shelf next to G.I. Nightingales and With Utmost Spirit, we got a call yesterday that she’s going to be on the radio on the Tavis Smiley show today.  I guess the interview is getting taped today, will probably get picked up on the radio show later today, and then will be podcast at some point on the website.

Anyway. I’m posting this here so maybe Google will pick up a link and anyone searching for “the role played by escaped slaves in the Union blockade along the Atlantic coast” will have a better chance of finding her interview.

September 1, 2009

Back from vacation

Filed under: Uncategorized — jmarca @ 8:14 am

Twitter’s insidious influence on my brain has me jotting things down in short phrases.  Postcards to myself.

Back from Hawaii.

Got some knitting done on Emma’s sweater (sleeve 1 is done, sleeve 2 is 80%).

Got some sun.

Got some food poisoning.

Got a clue that I definitely hurt my hip over Easter by swimming.

Got behind on my work.

Next Page »

Blog at WordPress.com.