Dante was like Tupac

This post is totally wrong, so there. Disclaimer ahoy.

So the lovely wife came home from some nutty adult education class with some interesting but completely irrelevant facts. One of them was that Dante apparently finished the Inferno just days before he died. I think not. I think more likely he died, and his krew was trying to get up the scratch for a new stable of horses so they put together some almost finished stuff and just *claimed* that Dante finished it. If Dante had died 1996, for sure he would have been on a giant big screen at this year’s Coachella festival.

From simple examples to complicated real world cases

I have a really irritating use-case for a CouchDB view. I have several hundred million documents representing hourly data for 4km grid cells in California, and I need to group them by areas. For example, grid cell i=100, j=223 is in Mendocino County, and in the “NORTH COAST” air basin. Of course I have the geometry of the grid cells and the geometry of the counties, air basins, and so on, in PostgreSQL/PostGIS, and I usually just shoot off a query to get the relationship and I’m done. This is CouchDB, however, and views cannot rely on external information lest they become idemimpotent (I made that up). Everything that the view needs must be in the view from the start.

Fair enough, I set up the SQL queries and generated my 9,800+ row JavaScript hash lookup table that maps grid cell to various areas of interest. Now I want to mix that into the view without pulling my hair out.

There is a really simple example in the CouchDB wiki. I’ve reproduced it below:

 {
   _id:"_design/test",
   language: "javascript",
   whatever : {
     stringzone : "exports.string = 'plankton';",
     commonjs : {
       whynot : "exports.test = require('../stringzone')",
       upper : "exports.testing = require('./whynot').test.string.toUpperCase()"
     }
   },
   shows: {
     simple: "function() {return 'ok'};",
     requirey : "function() { var lib = require('whatever/commonjs/upper'); return lib.testing; };"
   },
   views: {
     lib: { 
       foo: "exports.bar = 42;" 
     },
     test: { 
       map: "function(doc) { emit(doc._id, require('views/lib/foo').bar); }"
     }
   }
  }

So where the above example says foo: "exports.bar = 42;", I want to add in my massive hashtable. Obviously cutting and pasting so many lines is not the way to go. Instead, I’m using a couchapp tool.

The concept of a couchapp used to get more press that it currently seems to, but the basic idea is to use code to load up your design doc with attachments and views. In my case, I couldn’t care less about the attachments and the notion of a webapp stored and served by CouchDB. I just want to programmatically construct the view document, and push it to CouchDB. I chose to use node.couchapp.js. I could also have "rolled my own", and in fact I probably will this afternoon. I am playing around with grunt, so I used grunt_couchapp (after patching it a bit to use cookie based authentication).

The basic structure of my directory is the following


config.json
package.json
Gruntfile.js
app.js
lib
├── cellmembership.json
└── dump_membership.js
node_modules
├── ...
└── ...

The config.json file contains my database details, including my username and password. package.json contains the npm dependencies, mostly containing what was pulled in by the grunt_couchapp tool, and the node_modules directory holds all the node modules. I do not have an _attachments directory, so I make sure my design doc has no attachments!

Before getting to app.js, in which the design document is defined, I will first talk about what goes into it. The lookup table is stored as a JSON object in lib/cellmembership.json. The contents looks like:

{ "100_223":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
 "100_224":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
   ... 9,890 more lines like this ...
 "304_48":{"airbasin":"SALTON SEA","bas":"SS","county":"IMPERIAL","fips":"13","airdistrict":"IMPERIAL COUNTY APCD","dis":"IMP"},
 "98_247":{"airbasin":"NORTH COAST","bas":"NC","county":"HUMBOLDT","fips":"12","airdistrict":"NORTH COAST UNIFIED AQMD","dis":"NCU"}
}

The view code that uses this file is saved to lib/dump_membership.js, and looks like:

module.exports = function(doc){
    var lookup = require('views/lib/cellmembership').lookup
    emit(lookup[doc.cell_id].county, doc.value)
}

These two pieces are put together in app.js, that looks like this:

var couchapp = require('couchapp')
var cellmembership = require('./lib/cellmembership.json')
var mapfun = require('./lib/dump_membership')

var ddoc = {
    _id: '_design/calvad',
    rewrites: [{
      from: '',
      to: 'index.html',
      method: 'GET',
      query: {}
    },{
      from: '/*',
      to: '/*'
    }],
    views: {
        "lib":{
            "cellmembership":"exports.lookup="+JSON.stringify(cellmembership)
        },
        "test":{
            "map":mapfun
        }
    },
    lists: {},
    shows: {}
};


module.exports = ddoc;

So instead of "exports.bar=42;", I put in "exports.lookup="+JSON.stringify(...). The key insight that the simple example didn’t really convey is that you want your entire "library" module to be a string. So in this case that means saving my JSON lookup document as a string using JSON.stringify. I probably could have just loaded it directly using fs.readfile(), but I like this way, because it soothes my worries about malformed JSON. If the JSON is screwed up, the app.js won’t run, and the failure happens right away, not in the midst of cranking through hundreds of millions of documents.

The other bit that I didn’t get from the example was how to include an external function in the design document. What I did was pretty simple, and it worked. I just did "map":mapfun. This is exactly the opposite of what needed to be done with the views:lib:cellmembership.. construct. There the exports.lookup= statement needs to be a string inside of the JavaScript, whereas the assignment of the map function needs to be actual JavaScript code, not the string representation of that code.

This is exactly the kind of inconsistency that drives me nuts and that nobody ever thinks to document, because only crazies like me run into those edge cases.

Dream big

Robert Longo was a hot artist the year I graduated from college, with
a show called something like “Dream Jumbo: Working the Absolute” that
included an art exhibit at LACMA and a show at UCLA. We bought
tickets and went and it was great. We copied the idea of jumping
people, not painting them quite so large, but capturing the movements
and shadows nonetheless.

A year later I was in Europe, doing the backpack Eurail thing. I had
worked for a year and saved up a little money, enough to buy a used
Minolta. Once I got into the groove of traveling, life pretty much
revolved around looking for Romanesque churches, finding cheap hotels,
and strategically choosing night trains between cities.

I went to Europe with many rolls of film, some negative, some black
and white, but mostly slides. I shot all of it, and eventually had to
buy more. To guard against disaster, I would occasionally spot a deal
at a shop and would develop a batch of exposed rolls.

My past self is envious of my current self, with digital cameras not
needing the bag full of film canisters. Then I shot and shared my
images with close friends and family; now I can shoot and post to the
internet to theoretically share with everybody. I can “develop”
pictures on my laptop, and even shoot movies with my camera.

2013-10-14_01

My current self is envious of my past self, with no responsibilities
except to myself, able to go wherever and do whatever. I took
pictures, went to museums, and looked at old architecture. I played
harmonica in between cars on night trains. I watched my bank account
drain down, and got a cash advance on my credit card.

I haven’t heard anything about Robert Longo in years. He may still be
doing stuff, but I don’t care, and he’s certainly not as hot as he
once was. I take a lot more photographs now, but I don’t draw nearly
as much and I haven’t aspired to be an artist in years.

Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. Continue reading

How I use ffmpeg in Linux to record from Macbook Pro iSight

I have an older Macbook Pro (version 5,5) and I recently got screencasting working again after about a year in which nothing worked. There are two steps to getting a usable video output. First I needed to get audio recording working properly, then I needed to get the video to grab without dropping all the frames. Once I got it working I wrote it into a tiny little script that I’ve pasted below.

Continue reading

Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

Inspiration, redirection, and a broken arm

Craig and I have officially started our company, Activimetrics LLC. Our goal is to use the company as a platform to promote activity-based modeling approaches, but our target market is not as narrow as we first thought. My thinking about what we can do with our skills and experience has been broadened as a direct result of responding to the Markets for Good Data Interoperability Challenge. Continue reading

Public Planning Models

Craig and I just posted our entry into the Knight Newschallenge Lottery. It is called Public Planning Models, in a classic case of a working title ending up being the final title.

The basic idea is that planning models are opaque and mysterious, and really buggy and error prone. The problem isn’t the fault of the modelers or the model systems, but rather the lack of input data. Consider that a planning model first tries to model today’s world, and then tries to model the future using that same model with extrapolated conditions. There are two sources of error—the model of the present, and the extrapolation of that model into the future.

In a perfect, totalitarian state, the government would know everywhere you go, and all that information could be loaded into the model of the present. Calibration would be simple, because every vehicle is already in the model, so of course it captures reality. But even in a totalitarian, all-knowing state, predicting the future isn’t possible. Trends reverse themselves, people pick up different habits, and technology happens, changing the way we do things.

We have been watching and participating in the evolution of planning models, in particular pushing for the adoption of activity-based models over trip-based models. The big problem here is the burden of data collection, as well as the increased complexity of the model framework. Activity-based models are being incrementally adopted because they are too complicated and cost too much money to deploy.

Public Planning Models takes a different approach. Rather than trying to come up with better data collection processes and better modeling techniques, we thought it would be better to try to expose the full ugliness of current planning models to the public. This serves three purposes. First, people can see just how weak many of the fundamental assumptions in these models are. Second, everybody can take a look at the model system and suggest corrections and improvements, in the spirit of crowd-sourcing the model calibration step. And third, exposing the models and the applications of those models will give people an incentive to become more involved. That involvement can run the gamut from simply providing a few days worth of travel and activity data to the model’s input data set, to taking the model system itself and playing around with alternate planning scenarios.

Anyway, take a look at our proposal, add comments, and if you know one of the judges, put in a good word for our efforts. There are tons of submissions, and all of the ones I’ve read so far look pretty good.