So I finally spent an hour playing with node.js.
I wonder if there is any point to making traffic forecasts. Everybody likes weather forecasts and economic forecasts, and even global warming forecasts and peak oil forecasts. But I don’t see any traffic forecasts being made, and I’ve been thinking about why.
First off, I can’t see any direct benefit of making traffic forecasts. In the end, the information isn’t all that informative. The signal, the interesting and novel bit of information, must be something you didn’t know already, otherwise it isn’t informative. Traffic is always the same, save for the occasional incident, and the average driver sees and measures it every day. Therefore a prediction of traffic probably contains very little information to the consumer of the prediction, and so it isn’t likely that anyone will be willing to pay for traffic information.
Second, there is no benefit to the forecaster. With financial forecasts, you can make some real money. If I predict China does/doesn’t have an economic bubble and will/won’t go down the toilet, I can place bets (oops, pardon me Wall Street isn’t Las Vegas, so I really mean “buy stock in or sell short”) companies that will be affected by what I predict are the most likely outcomes. This is not the case with traffic. Even if I predict an accident on Interstate 5 at 8:05 AM next Tuesday, and it happens, and people plan accordingly, they’ll save a small amount of time and most likely be inconvenienced even more by adjusting their schedules and deviating from their usual routine. And the prediction isn’t likely to come true, and when discounted accordingly any traffic prediction is worthless. So who would pay me to make my forecasts?
It all seems pretty pointless. Unless one is stuck in traffic, wondering why no one could predict this jam and why no one is doing anything about it.
Which brings to mind the idea that people are uninterested in traffic forecasts because traffic is at once our own fault, and eminently repeatable. We condition ourselves to leave at the same times everyday to get to our destinations at the appropriate time given our daily re-appraisal of prevailing traffic conditions. The only unknowns are traffic accidents, which can’t really be predicted, and unknown trips, for which the prudent allow copious amounts of time.
And that leads to my last point. What if we could predict traffic accidents? Should we do so? Suppose we could say with some confidence that every day from 8am to 8:30am on such and such a stretch of highway the relative risk of an accident is 1,000% higher than usual, perhaps due to a regular surge of traffic at that time or they way the sunlight hits drivers’ eyes, etc. Sure the absolute risk of an accident would still be microscopically small, but over a year you might see 2 or 3 more accidents at that time and place than elsewhere. So suppose we go out on a limb, and publicly predict a higher relative risk of an accident, and then lo and behold an accident does occur. Will we the predictors be held legally liable for the accident? Will the victims’ families drag us into court and ask the judge “If they knew there was a higher risk of an accident, why didn’t they do something about it?” I’d answer that I did do something about it…I made a prediction and publicized it.
In the end it is probably better to just keep quiet, and tell people traffic is bad because they like to travel about all day long.
I’ve found that I prefer making things to maintaining things. My wife will testify that tidying up is not my forte, but that I don’t mind the most laborious cooking task.
I have an idea. I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/). So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading
… except for nukes and bocci.
I can *almost* make bootstrapping work, but not entirely within couchdb. I am going to have to do external processing. Which is probably fine. Continue reading
Closer, but still not yet there using bootstrap sampling in Couchdb. My prior post was mostly thinking out loud. I’ve tried some things since, and this post is an attempt to organize my thoughts on the topic.
Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb. I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median. Most of the older sources I’ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n. But I can’t do that—my input data is too big. So I have to pick a smaller n. So I’m going with 1,000 for starters, and repeat the draw 10,000 times.
(There’s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I’m not going to dive into that yet.) Continue reading
It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space. So no matter what, if I process and save results, it will take time and space. We’ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.
So the presumption is that it is actually *worth* the time and space to compute and store the data. This isn’t necessarily the case. In fact, what I really want access to are the long-term averages of the accident risk values over time. Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.
The problem is that it is difficult to maintain historical trends without keeping the data handy. As I’ve said in prior postings and in my notes, I really like how CouchDB’s map reduce approach allows the generation of different layers of statistics. By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys. So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.
On the other hand, this is limiting. If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can’t. I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed—not trivial when the input set is about 15G per week.
As CouchDB matures, perhaps it will do a faster job computing views. The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.
Finally, if I delete old data, it isn’t clear to me how I would still maintain the running computations of mean and variance. Technically it is possible—all you have to do is combine partial compuations, knowing the number of observations that fed into each one. But practically, I have a feeling that when I delete input data, the output will get blown away.
Perhaps the best approach is to maintain couchdb for just a day’s worth of data, and run a separate postgresql process to store the map reduce output. Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.
I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good. This would then feed higher level aggregation tables (like day, year, and so on). By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).
I can’t seem to get an efficient setup going for storing loop data in couchdb. On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document. But for this application this is more limiting than I first thought. The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB. I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.
Started up a new project recently to estimate traffic flows. Our first question is to extract truck traffic estimates from those estimates. Continue reading