Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

Inspiration, redirection, and a broken arm

Craig and I have officially started our company, Activimetrics LLC. Our goal is to use the company as a platform to promote activity-based modeling approaches, but our target market is not as narrow as we first thought. My thinking about what we can do with our skills and experience has been broadened as a direct result of responding to the Markets for Good Data Interoperability Challenge. Continue reading

Public Planning Models

Craig and I just posted our entry into the Knight Newschallenge Lottery. It is called Public Planning Models, in a classic case of a working title ending up being the final title.

The basic idea is that planning models are opaque and mysterious, and really buggy and error prone. The problem isn’t the fault of the modelers or the model systems, but rather the lack of input data. Consider that a planning model first tries to model today’s world, and then tries to model the future using that same model with extrapolated conditions. There are two sources of error—the model of the present, and the extrapolation of that model into the future.

In a perfect, totalitarian state, the government would know everywhere you go, and all that information could be loaded into the model of the present. Calibration would be simple, because every vehicle is already in the model, so of course it captures reality. But even in a totalitarian, all-knowing state, predicting the future isn’t possible. Trends reverse themselves, people pick up different habits, and technology happens, changing the way we do things.

We have been watching and participating in the evolution of planning models, in particular pushing for the adoption of activity-based models over trip-based models. The big problem here is the burden of data collection, as well as the increased complexity of the model framework. Activity-based models are being incrementally adopted because they are too complicated and cost too much money to deploy.

Public Planning Models takes a different approach. Rather than trying to come up with better data collection processes and better modeling techniques, we thought it would be better to try to expose the full ugliness of current planning models to the public. This serves three purposes. First, people can see just how weak many of the fundamental assumptions in these models are. Second, everybody can take a look at the model system and suggest corrections and improvements, in the spirit of crowd-sourcing the model calibration step. And third, exposing the models and the applications of those models will give people an incentive to become more involved. That involvement can run the gamut from simply providing a few days worth of travel and activity data to the model’s input data set, to taking the model system itself and playing around with alternate planning scenarios.

Anyway, take a look at our proposal, add comments, and if you know one of the judges, put in a good word for our efforts. There are tons of submissions, and all of the ones I’ve read so far look pretty good.

CouchDB and Erlang

Typical left-field introduction

As far as I understand it, the ability to run Erlang views natively is likely to be removed in the future because it does not offer any sandboxing of the content, and so the view can execute arbitrary commands on the server. So Erlang views are likely to go away.

Problem: big docs crash JSON.parse()

That said, I have a use case for Erlang views. Continue reading

Mode choice versus life cycle change

During TRB I attended a presentation on the effect of life cycle changes on travel pattern characteristics. The presenter defined the usual life cycle changes (getting married, changing home location, having a child, etc) and set up a structural equations model to related these changes with the size of a person’s social network, the length (distance) and number of trips per day, the length (duration) and number of activities per day, and so on.

The work was interesting and got me thinking whether one could treat “being green” as a life cycle choice rather than as a mode choice. In the usual mode choice context, Continue reading

Using superagent to authenticate a user-agent in node.js, plus a bonus bug!

Summary

This post describes how I use the superagent library to test access to restricted resources on my web server. This is something that I found to take a bit more effort than I expected, so I thought I’d write this up for the greater good.

Context

I am running a website in which some resources are open to the internet, while others require authentication versus our CAS server.

I have been logging into the CAS server using request. But in the interest of trying out different libraries and all that, I decided to rewrite my
method using superagent.

I need to log into the CAS server from node.js because I am writing tests that verify that the protected resources are hidden to non-authenticated users, and available to authenticated ones. Continue reading