A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

So one db doesn’t work so well.  There was a posting to the user mailing list asking how many databases people were using.  So I gave that a shot—trying one db per loop, with the same map/reduce.  The downside is that I can’t compute averages across loops as before, but that’s okay because I couldn’t get the view generation to finish at all.

But I still have problems.  The main couchdb process (beam in top) runs up to 80 or 90 percent CPU usage, and there are lots of javascript child processes split off, but the view computation is still super slow, even on just a few days of data (14 GB uncompressed, including view cache)

I’m thinking that the only way to avoid this problem is to keep updating the view with every insert into the database.  But I’m worried that will fall behind real time, let alone allow me to move backwards and process last year’s data too.  Without any super quantitative measures, it seems from my experience that if I get about a gigabyte behind the curve on computing the cached view, I can’t keep up—data loading goes too fast, and index processing never finishes.   Or I get mystery errors like:

at /usr/lib/perl5/site_perl/5.8.8/i486-linux-thread-multi/Coro.pm line 419
    Reason => "Connection timed out",
    Status => 599,
    URL => "http://localhost:5984/safetydb1204650/_view/riskstats2/All",
] at ping_couchdbs.pl line 61

from my program that pings the views for all of the databases.

So maybe for now I need to go back to postgresql.  I do like the map reduce part of Couchdb, and I do like the unstructured doc format, but perhaps it isn’t so good for massive number crunching yet. But not being able to get a year of data in and a valid annual average out is kind of a show stopper.

But to be fair, I couldn’t do that in Postgresql either.  My use of Couchdb may help there, as I now have a document centric view of the data, rather than a relational view.  And before I get off the couch (sorry), I still need to look into alternate view servers.  Maybe I can make a view server in C that can run faster than the javascript map/reduce process.   But that will have to wait until next weekend probably while I finish other projects.


2 thoughts on “A lot of data is a lot of data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s