Time and space

It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We’ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.

So the presumption is that it is actually *worth* the time and space to compute and store the data.  This isn’t necessarily the case.  In fact, what I really want access to are the long-term averages of the accident risk values over time.  Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.

The problem is that it is difficult to maintain historical trends without keeping the data handy.  As I’ve said in prior postings and in my notes, I really like how CouchDB’s map reduce approach allows the generation of different layers of statistics.  By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys.   So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.

On the other hand, this is limiting.  If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can’t.  I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed—not trivial when the input set is about 15G per week.

As CouchDB matures, perhaps it will do a faster job computing views.  The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.

Finally, if I delete old data, it isn’t clear to me how I would still maintain the running computations of mean and variance.  Technically it is possible—all you have to do is combine partial compuations, knowing the number of observations that fed into each one.  But practically, I have a feeling that when I delete input data, the output will get blown away.

Perhaps the best approach is to maintain couchdb for just a day’s worth of data, and run a separate postgresql process to store the map reduce output.  Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.

I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good.  This would then feed higher level aggregation tables (like day, year, and so on).  By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).

A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

Continue reading