R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). Continue reading

Musing about traffic forecasts

I wonder if there is any point to making traffic forecasts. Everybody likes weather forecasts and economic forecasts, and even global warming forecasts and peak oil forecasts. But I don’t see any traffic forecasts being made, and I’ve been thinking about why.

First off, I can’t see any direct benefit of making traffic forecasts. In the end, the information isn’t all that informative. The signal, the interesting and novel bit of information, must be something you didn’t know already, otherwise it isn’t informative. Traffic is always the same, save for the occasional incident, and the average driver sees and measures it every day. Therefore a prediction of traffic probably contains very little information to the consumer of the prediction, and so it isn’t likely that anyone will be willing to pay for traffic information.

Second, there is no benefit to the forecaster. With financial forecasts, you can make some real money. If I predict China does/doesn’t have an economic bubble and will/won’t go down the toilet, I can place bets (oops, pardon me Wall Street isn’t Las Vegas, so I really mean “buy stock in or sell short”) companies that will be affected by what I predict are the most likely outcomes. This is not the case with traffic. Even if I predict an accident on Interstate 5 at 8:05 AM next Tuesday, and it happens, and people plan accordingly, they’ll save a small amount of time and most likely be inconvenienced even more by adjusting their schedules and deviating from their usual routine. And the prediction isn’t likely to come true, and when discounted accordingly any traffic prediction is worthless. So who would pay me to make my forecasts?

It all seems pretty pointless. Unless one is stuck in traffic, wondering why no one could predict this jam and why no one is doing anything about it.

Which brings to mind the idea that people are uninterested in traffic forecasts because traffic is at once our own fault, and eminently repeatable. We condition ourselves to leave at the same times everyday to get to our destinations at the appropriate time given our daily re-appraisal of prevailing traffic conditions. The only unknowns are traffic accidents, which can’t really be predicted, and unknown trips, for which the prudent allow copious amounts of time.

And that leads to my last point. What if we could predict traffic accidents? Should we do so? Suppose we could say with some confidence that every day from 8am to 8:30am on such and such a stretch of highway the relative risk of an accident is 1,000% higher than usual, perhaps due to a regular surge of traffic at that time or they way the sunlight hits drivers’ eyes, etc. Sure the absolute risk of an accident would still be microscopically small, but over a year you might see 2 or 3 more accidents at that time and place than elsewhere. So suppose we go out on a limb, and publicly predict a higher relative risk of an accident, and then lo and behold an accident does occur. Will we the predictors be held legally liable for the accident? Will the victims’ families drag us into court and ask the judge “If they knew there was a higher risk of an accident, why didn’t they do something about it?” I’d answer that I did do something about it…I made a prediction and publicized it.

In the end it is probably better to just keep quiet, and tell people traffic is bad because they like to travel about all day long.

Continue reading

RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading

Bootstrap in a view

Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I’ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can’t do that—my input data is too big.  So I have to pick a smaller n.  So I’m going with 1,000 for starters, and repeat the draw 10,000 times.

(There’s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I’m not going to dive into that yet.) Continue reading

Time and space

It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We’ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.

So the presumption is that it is actually *worth* the time and space to compute and store the data.  This isn’t necessarily the case.  In fact, what I really want access to are the long-term averages of the accident risk values over time.  Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.

The problem is that it is difficult to maintain historical trends without keeping the data handy.  As I’ve said in prior postings and in my notes, I really like how CouchDB’s map reduce approach allows the generation of different layers of statistics.  By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys.   So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.

On the other hand, this is limiting.  If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can’t.  I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed—not trivial when the input set is about 15G per week.

As CouchDB matures, perhaps it will do a faster job computing views.  The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.

Finally, if I delete old data, it isn’t clear to me how I would still maintain the running computations of mean and variance.  Technically it is possible—all you have to do is combine partial compuations, knowing the number of observations that fed into each one.  But practically, I have a feeling that when I delete input data, the output will get blown away.

Perhaps the best approach is to maintain couchdb for just a day’s worth of data, and run a separate postgresql process to store the map reduce output.  Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.

I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good.  This would then feed higher level aggregation tables (like day, year, and so on).  By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).

A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

Continue reading