Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

Public Planning Models

Craig and I just posted our entry into the Knight Newschallenge Lottery. It is called Public Planning Models, in a classic case of a working title ending up being the final title.

The basic idea is that planning models are opaque and mysterious, and really buggy and error prone. The problem isn’t the fault of the modelers or the model systems, but rather the lack of input data. Consider that a planning model first tries to model today’s world, and then tries to model the future using that same model with extrapolated conditions. There are two sources of error—the model of the present, and the extrapolation of that model into the future.

In a perfect, totalitarian state, the government would know everywhere you go, and all that information could be loaded into the model of the present. Calibration would be simple, because every vehicle is already in the model, so of course it captures reality. But even in a totalitarian, all-knowing state, predicting the future isn’t possible. Trends reverse themselves, people pick up different habits, and technology happens, changing the way we do things.

We have been watching and participating in the evolution of planning models, in particular pushing for the adoption of activity-based models over trip-based models. The big problem here is the burden of data collection, as well as the increased complexity of the model framework. Activity-based models are being incrementally adopted because they are too complicated and cost too much money to deploy.

Public Planning Models takes a different approach. Rather than trying to come up with better data collection processes and better modeling techniques, we thought it would be better to try to expose the full ugliness of current planning models to the public. This serves three purposes. First, people can see just how weak many of the fundamental assumptions in these models are. Second, everybody can take a look at the model system and suggest corrections and improvements, in the spirit of crowd-sourcing the model calibration step. And third, exposing the models and the applications of those models will give people an incentive to become more involved. That involvement can run the gamut from simply providing a few days worth of travel and activity data to the model’s input data set, to taking the model system itself and playing around with alternate planning scenarios.

Anyway, take a look at our proposal, add comments, and if you know one of the judges, put in a good word for our efforts. There are tons of submissions, and all of the ones I’ve read so far look pretty good.

Mode choice versus life cycle change

During TRB I attended a presentation on the effect of life cycle changes on travel pattern characteristics. The presenter defined the usual life cycle changes (getting married, changing home location, having a child, etc) and set up a structural equations model to related these changes with the size of a person’s social network, the length (distance) and number of trips per day, the length (duration) and number of activities per day, and so on.

The work was interesting and got me thinking whether one could treat “being green” as a life cycle choice rather than as a mode choice. In the usual mode choice context, Continue reading

Reduced parking requirements article

There is an article in today’s LA Times that talks about a move to reduce the parking requirements of various kinds of retail. This is very interesting and could begin to push people to reduce driving. In parallel, there are a few laws on the books in California that require denser development in order to reduce greenhouse gas emissions. Now denser development by itself will not reduce greenhouse gas emissions, and may in fact make things worse if everybody keeps driving to exactly what they do now (imagine…more destinations crammed into a smaller space means more cars on the same streets means more traffic means more emissions). But, if denser development is paired with reduced parking requirements, there is even more incentive to leave the car home for a trip or two (as there will be nowhere to park it when you get there).
Continue reading

Development server logs during development

In a prior post trumpeting my modest success with getting geojson tiles to work, I typed in my server address, but didn’t make it a link. That way robots wouldn’t automatically follow the link and my development server wouldn’t get indexed by Google indirectly.

What is interesting to me is that I still get the occasional hit from that posting. And this is with the server bouncing up and down almost continuously as I add functionality. Just now I was refactoring the tile caching service I wrote, and in between server restarts, someone hit my demo app.

And the GeoJSON tiler is coming along. In making the caching part more robust, I added a recursive directory creation hack which I explain below.

Continue reading

California Traffic Management Labs

We are searching for a new name for our physical and intellectual resources here at UCI. We have a real-world laboratory in that we have streets and highways that are instrumented. We used to call ourselves “ATMS Testbed”, and we still call ourselves Testbed, but we’re trying to push the notion that we aren’t ATMS. ATMS stands for Advanced Transportation Mangement System, but it has been usurped by its association with the software that is used to run the modern traffic control centers. So ATMS sounds like we just work on the ATMS software, but we actually do almost nothing with the ATMS software!

So we kicked around some names on email, and had a meeting this morning to discuss the name, and we very quickly settled on California Traffic Management Labs, in less than an hour! Magically, CTMLabs.org, .com, and .net are all available, so we got them.  We decide to use http://www.ctmlabs.net as the primary site, because hey, “net” is like network, which is what we do.

So, once we get our website up and running at the end of the summer, if you want to do traffic management research and deployment, come to http://CTMLabs.net and see what we have to offer.

R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). Continue reading

Musing about traffic forecasts

I wonder if there is any point to making traffic forecasts. Everybody likes weather forecasts and economic forecasts, and even global warming forecasts and peak oil forecasts. But I don’t see any traffic forecasts being made, and I’ve been thinking about why.

First off, I can’t see any direct benefit of making traffic forecasts. In the end, the information isn’t all that informative. The signal, the interesting and novel bit of information, must be something you didn’t know already, otherwise it isn’t informative. Traffic is always the same, save for the occasional incident, and the average driver sees and measures it every day. Therefore a prediction of traffic probably contains very little information to the consumer of the prediction, and so it isn’t likely that anyone will be willing to pay for traffic information.

Second, there is no benefit to the forecaster. With financial forecasts, you can make some real money. If I predict China does/doesn’t have an economic bubble and will/won’t go down the toilet, I can place bets (oops, pardon me Wall Street isn’t Las Vegas, so I really mean “buy stock in or sell short”) companies that will be affected by what I predict are the most likely outcomes. This is not the case with traffic. Even if I predict an accident on Interstate 5 at 8:05 AM next Tuesday, and it happens, and people plan accordingly, they’ll save a small amount of time and most likely be inconvenienced even more by adjusting their schedules and deviating from their usual routine. And the prediction isn’t likely to come true, and when discounted accordingly any traffic prediction is worthless. So who would pay me to make my forecasts?

It all seems pretty pointless. Unless one is stuck in traffic, wondering why no one could predict this jam and why no one is doing anything about it.

Which brings to mind the idea that people are uninterested in traffic forecasts because traffic is at once our own fault, and eminently repeatable. We condition ourselves to leave at the same times everyday to get to our destinations at the appropriate time given our daily re-appraisal of prevailing traffic conditions. The only unknowns are traffic accidents, which can’t really be predicted, and unknown trips, for which the prudent allow copious amounts of time.

And that leads to my last point. What if we could predict traffic accidents? Should we do so? Suppose we could say with some confidence that every day from 8am to 8:30am on such and such a stretch of highway the relative risk of an accident is 1,000% higher than usual, perhaps due to a regular surge of traffic at that time or they way the sunlight hits drivers’ eyes, etc. Sure the absolute risk of an accident would still be microscopically small, but over a year you might see 2 or 3 more accidents at that time and place than elsewhere. So suppose we go out on a limb, and publicly predict a higher relative risk of an accident, and then lo and behold an accident does occur. Will we the predictors be held legally liable for the accident? Will the victims’ families drag us into court and ask the judge “If they knew there was a higher risk of an accident, why didn’t they do something about it?” I’d answer that I did do something about it…I made a prediction and publicized it.

In the end it is probably better to just keep quiet, and tell people traffic is bad because they like to travel about all day long.

Continue reading