From simple examples to complicated real world cases

I have a really irritating use-case for a CouchDB view. I have several hundred million documents representing hourly data for 4km grid cells in California, and I need to group them by areas. For example, grid cell i=100, j=223 is in Mendocino County, and in the “NORTH COAST” air basin. Of course I have the geometry of the grid cells and the geometry of the counties, air basins, and so on, in PostgreSQL/PostGIS, and I usually just shoot off a query to get the relationship and I’m done. This is CouchDB, however, and views cannot rely on external information lest they become idemimpotent (I made that up). Everything that the view needs must be in the view from the start.

Fair enough, I set up the SQL queries and generated my 9,800+ row JavaScript hash lookup table that maps grid cell to various areas of interest. Now I want to mix that into the view without pulling my hair out.

There is a really simple example in the CouchDB wiki. I’ve reproduced it below:

 {
   _id:"_design/test",
   language: "javascript",
   whatever : {
     stringzone : "exports.string = 'plankton';",
     commonjs : {
       whynot : "exports.test = require('../stringzone')",
       upper : "exports.testing = require('./whynot').test.string.toUpperCase()"
     }
   },
   shows: {
     simple: "function() {return 'ok'};",
     requirey : "function() { var lib = require('whatever/commonjs/upper'); return lib.testing; };"
   },
   views: {
     lib: { 
       foo: "exports.bar = 42;" 
     },
     test: { 
       map: "function(doc) { emit(doc._id, require('views/lib/foo').bar); }"
     }
   }
  }

So where the above example says foo: "exports.bar = 42;", I want to add in my massive hashtable. Obviously cutting and pasting so many lines is not the way to go. Instead, I’m using a couchapp tool.

The concept of a couchapp used to get more press that it currently seems to, but the basic idea is to use code to load up your design doc with attachments and views. In my case, I couldn’t care less about the attachments and the notion of a webapp stored and served by CouchDB. I just want to programmatically construct the view document, and push it to CouchDB. I chose to use node.couchapp.js. I could also have "rolled my own", and in fact I probably will this afternoon. I am playing around with grunt, so I used grunt_couchapp (after patching it a bit to use cookie based authentication).

The basic structure of my directory is the following


config.json
package.json
Gruntfile.js
app.js
lib
├── cellmembership.json
└── dump_membership.js
node_modules
├── ...
└── ...

The config.json file contains my database details, including my username and password. package.json contains the npm dependencies, mostly containing what was pulled in by the grunt_couchapp tool, and the node_modules directory holds all the node modules. I do not have an _attachments directory, so I make sure my design doc has no attachments!

Before getting to app.js, in which the design document is defined, I will first talk about what goes into it. The lookup table is stored as a JSON object in lib/cellmembership.json. The contents looks like:

{ "100_223":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
 "100_224":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
   ... 9,890 more lines like this ...
 "304_48":{"airbasin":"SALTON SEA","bas":"SS","county":"IMPERIAL","fips":"13","airdistrict":"IMPERIAL COUNTY APCD","dis":"IMP"},
 "98_247":{"airbasin":"NORTH COAST","bas":"NC","county":"HUMBOLDT","fips":"12","airdistrict":"NORTH COAST UNIFIED AQMD","dis":"NCU"}
}

The view code that uses this file is saved to lib/dump_membership.js, and looks like:

module.exports = function(doc){
    var lookup = require('views/lib/cellmembership').lookup
    emit(lookup[doc.cell_id].county, doc.value)
}

These two pieces are put together in app.js, that looks like this:

var couchapp = require('couchapp')
var cellmembership = require('./lib/cellmembership.json')
var mapfun = require('./lib/dump_membership')

var ddoc = {
    _id: '_design/calvad',
    rewrites: [{
      from: '',
      to: 'index.html',
      method: 'GET',
      query: {}
    },{
      from: '/*',
      to: '/*'
    }],
    views: {
        "lib":{
            "cellmembership":"exports.lookup="+JSON.stringify(cellmembership)
        },
        "test":{
            "map":mapfun
        }
    },
    lists: {},
    shows: {}
};


module.exports = ddoc;

So instead of "exports.bar=42;", I put in "exports.lookup="+JSON.stringify(...). The key insight that the simple example didn’t really convey is that you want your entire "library" module to be a string. So in this case that means saving my JSON lookup document as a string using JSON.stringify. I probably could have just loaded it directly using fs.readfile(), but I like this way, because it soothes my worries about malformed JSON. If the JSON is screwed up, the app.js won’t run, and the failure happens right away, not in the midst of cranking through hundreds of millions of documents.

The other bit that I didn’t get from the example was how to include an external function in the design document. What I did was pretty simple, and it worked. I just did "map":mapfun. This is exactly the opposite of what needed to be done with the views:lib:cellmembership.. construct. There the exports.lookup= statement needs to be a string inside of the JavaScript, whereas the assignment of the map function needs to be actual JavaScript code, not the string representation of that code.

This is exactly the kind of inconsistency that drives me nuts and that nobody ever thinks to document, because only crazies like me run into those edge cases.

Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. Continue reading

How I fire off multiple R jobs from node.js

Node.js has become my hammer of choice for most systems programming type jobs. In an earlier post I talked about how to use CouchDB to store the progress and state of jobs that need doing. Here I will demonstrate how I trigger those jobs and update CouchDB using a fairly simple node.js program.

Two key features of node that makes this program possible are spawn and being able to read and manipulate the environment variables.

var spawn = require('child_process').spawn
var env = process.env

Node.js is fully capable of using child processes. One can choose from exec, execFile, spawn, and fork. For my usage, the spawn function does exactly what I want—it creates a child process that reports back when it exits.

The other useful tool is the ability to access the current running environment using the process.env variable. This allows my program to take note of any environment variables that are already set, and to fill in any missing variables that my child process might need.

Concurrency via queue

Using spawn one can fire off as many jobs as desired. Suppose you have a machine with four cores, then calling spawn four times will efficiently use your processing power. Unfortunately it isn’t usually that simple. Instead, what typically happens is that you have a lot of separable data crunching tasks that need to be run, and you want to have four data processing jobs running at all times until the work is all done. To accomplish this, the spawn function will need to be called four times (to fill up the processors) and then will need to spawn a new job whenever one of the existing jobs finishes.

Continue reading

Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

CouchDB and Erlang

Typical left-field introduction

As far as I understand it, the ability to run Erlang views natively is likely to be removed in the future because it does not offer any sandboxing of the content, and so the view can execute arbitrary commands on the server. So Erlang views are likely to go away.

Problem: big docs crash JSON.parse()

That said, I have a use case for Erlang views. Continue reading

How big is “too big” for documents in CouchDB: Some biased and totally unscientific test results!

I have been storing documents somewhat heuristically in CouchDB. Without doing any rigorous tests, and without keeping track of versions and the associated performance enhancements, I have a general rule that tiny documents are too small, and really big documents are too big.

To illustrate the issues, consider a simple detector that collects data every 30 seconds. One approach is to create one document per observation. Over a day, this will create 2880 documents (except for those pesky daylight savings time days, of course). Over a year, this will create over a million documents. If you have just one detector, then this is probably okay, but if you have thousands or millions of them, this is a lot of individual documents to store, and disk size becomes an issue.
Continue reading

Iterating view and doc designs multiple times

Just a quick post so that I remember to elaborate on this later.  I have found that whenever I have a large project to do in CouchDB I go through several iterations of designing the documents and the views.

My latest project is typical.

  1. First design was to push in really big documents.  The idea was to run map reduce copy the reduce output to a second db, and map reduce that for the final result.   But the view generation was too slow, I never got around to designing the second db, and the biggest documents triggered a bug/memory issue.
  2. Continue reading

Watching views build oh so slowly

I have an application that is taxing my PostgreSQL install, and I’ve been taking a whack at using CouchDB to solve it instead.

On the surface, it looks like a pretty good use case, but I’m having trouble getting it to move fast enough.

In a nutshell, I am storing the output of a multiple imputation process. At the moment my production system uses PostgreSQL for this. I store each imputation output, one record per row. I have about 360 million imputation stored this way.

Each imputation represents an estimate of conditions at a mainline freeway detector. That is done in R using the excellent Amelia package. While the imputation is done for all lanes at the site, because I am storing the data in a relational database with a schema, I decided to store one row per lane. Continue reading

Replicator database in practice

The replicator database in couchdb is cool, but one needs to be mindful when using it.

I like it better than sending a message to couch db to replicate dbx from machine y to machine z, because I can be confident that even if I happen to restart couch, that replication is going to finish up.

The problem is that for replications that are not continuous, I end up with a bunch of replication entries in the replicator database. Thousands sometimes. Until I get impatient and just delete the whole thing.

For the way I use it, the best solution is to write a view into the db to pick off all of the replications that are not continuous and that have completed successfully, and then do a bulk delete of those documents. But I’m never organized enough to get that done.

Here’s hoping such a function finds its way into Futon some day.

When R and JSON fight

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.