RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading

Bootstrap in a view

Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I’ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can’t do that—my input data is too big.  So I have to pick a smaller n.  So I’m going with 1,000 for starters, and repeat the draw 10,000 times.

(There’s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I’m not going to dive into that yet.) Continue reading

Time and space

It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We’ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.

So the presumption is that it is actually *worth* the time and space to compute and store the data.  This isn’t necessarily the case.  In fact, what I really want access to are the long-term averages of the accident risk values over time.  Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.

The problem is that it is difficult to maintain historical trends without keeping the data handy.  As I’ve said in prior postings and in my notes, I really like how CouchDB’s map reduce approach allows the generation of different layers of statistics.  By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys.   So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.

On the other hand, this is limiting.  If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can’t.  I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed—not trivial when the input set is about 15G per week.

As CouchDB matures, perhaps it will do a faster job computing views.  The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.

Finally, if I delete old data, it isn’t clear to me how I would still maintain the running computations of mean and variance.  Technically it is possible—all you have to do is combine partial compuations, knowing the number of observations that fed into each one.  But practically, I have a feeling that when I delete input data, the output will get blown away.

Perhaps the best approach is to maintain couchdb for just a day’s worth of data, and run a separate postgresql process to store the map reduce output.  Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.

I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good.  This would then feed higher level aggregation tables (like day, year, and so on).  By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).

A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

Continue reading

Trevor’s Autonet paper published

Trevor’s Autonet paper finally got published, and we’ve gotten a small bit of press.  Funny how that works.  Do research and build a prototype.  Write a paper or two or four, apparently get no interest.  Project mostly trickles off.  Then one paper finally gets published by a slower journal, and hey, everybody is interested.

While the ideas are good, and while Trevor and his team did a great job with the prototype and got a working system running, I think the real barrier to something like Autonet taking off is the difficulty in getting  a local area wireless connection up and running.  Not from a technical, bit/bytes/hand-off/Doppler-shift point of view.  Rather from a non-technical user’s point of view.  It is quite difficult to set up a device so that it both blabs and listens on some open wireless channel without requiring careful attention from the user.  Most wifi links, in contrast, are pretty simple to use because there is a defined server and client. But even then most dialogs ask the user to select which host to access, and some require some sort of password or access code.

In the intervening years between working on that stuff and where we are now, we’ve sort of come to the conclusion that the data channel isn’t as important as just freeing the information from the automobile.  From the person traveling, really.

The primary advantage of a local area wireless connection is that, well, those cars and devices you can talk probably have data that are relevant to you too, because you’re all sitting in the same spot.  The local area wireless link acts like a spatial query on the huge mountain of traffic data that is available.  The disadvantage is the need to configure your wireless device in a secure, user friendly way, and needing to develop some sort of protocol to query distant locations.

On the other hand, a cellular link does not have automatic spatial query on the data.  Of course you can *do* a spatial query, but that costs some cpu cycles, whereas with the Autonet idea, you’re *only* querying geographically proximate neighbors.  You’ve also got the problem that the wide area wireless links cost money to use.  Cellphone companies are known to charge outrageous rates for data transfer, and in fact, AT&T specifically forbids using their data connection in the manner in which we would *like* to use it.  To quote from their service agreement terms and conditions:

Prohibited and Permissible Uses: Except as may otherwise be specifically permitted or prohibited for select data plans, data sessions may be conducted only for the following purposes: (i) Internet browsing; (ii) email; and (iii) intranet access. …[T]here are certain uses that cause extreme network capacity issues and interference with the network and are therefore prohibited. Examples of prohibited uses include, without limitation, the following: (i) server devices or host computer applications, including, but not limited to, Web camera posts or broadcasts, automatic data feeds, automated machine-to-machine connections or peer-to-peer (P2P) file sharing; …

So, an app that automatically uploads location and speed and queries traffic conditions every few seconds is out, but an application that “browses the internet” is okay.   So an application that responds to user input to “browse” the internet with a heartbeat ping is probably okay, but making it a daemon that bleeps every few minutes is not.

Gotta get us some iPhones so we can test this stuff out, I guess.  Which means we have to get funding.