Time and space

It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We’ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.

So the presumption is that it is actually *worth* the time and space to compute and store the data.  This isn’t necessarily the case.  In fact, what I really want access to are the long-term averages of the accident risk values over time.  Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.

The problem is that it is difficult to maintain historical trends without keeping the data handy.  As I’ve said in prior postings and in my notes, I really like how CouchDB’s map reduce approach allows the generation of different layers of statistics.  By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys.   So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.

On the other hand, this is limiting.  If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can’t.  I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed—not trivial when the input set is about 15G per week.

As CouchDB matures, perhaps it will do a faster job computing views.  The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.

Finally, if I delete old data, it isn’t clear to me how I would still maintain the running computations of mean and variance.  Technically it is possible—all you have to do is combine partial compuations, knowing the number of observations that fed into each one.  But practically, I have a feeling that when I delete input data, the output will get blown away.

Perhaps the best approach is to maintain couchdb for just a day’s worth of data, and run a separate postgresql process to store the map reduce output.  Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.

I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good.  This would then feed higher level aggregation tables (like day, year, and so on).  By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).

One thought on “Time and space

  1. You could periodically query the reduce view with various group-levels and store the rows as documents in another db, then you can do quick queries on the reduced data. Also you can thow out the source data and just keep the reductions if space is an issue.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.