Watching views build oh so slowly

I have an application that is taxing my PostgreSQL install, and I’ve been taking a whack at using CouchDB to solve it instead.

On the surface, it looks like a pretty good use case, but I’m having trouble getting it to move fast enough.

In a nutshell, I am storing the output of a multiple imputation process. At the moment my production system uses PostgreSQL for this. I store each imputation output, one record per row. I have about 360 million imputation stored this way.

Each imputation represents an estimate of conditions at a mainline freeway detector. That is done in R using the excellent Amelia package. While the imputation is done for all lanes at the site, because I am storing the data in a relational database with a schema, I decided to store one row per lane. Continue reading

Replicator database in practice

The replicator database in couchdb is cool, but one needs to be mindful when using it.

I like it better than sending a message to couch db to replicate dbx from machine y to machine z, because I can be confident that even if I happen to restart couch, that replication is going to finish up.

The problem is that for replications that are not continuous, I end up with a bunch of replication entries in the replicator database. Thousands sometimes. Until I get impatient and just delete the whole thing.

For the way I use it, the best solution is to write a view into the db to pick off all of the replications that are not continuous and that have completed successfully, and then do a bulk delete of those documents. But I’m never organized enough to get that done.

Here’s hoping such a function finds its way into Futon some day.

When R and JSON fight

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.

Slacking on the Couch

I run Slackware. I also use CouchDB. Seems like a natural fit, but the slackbuild on SlackBuilds.org is stuck at 0.11.

That’s okay, it is a good script and works well with the latest version. However, I don’t want to run the latest release of CouchDB, I want to run 1.2.x from the git repository, because I really like the new replication engine for my work.

So, I had to do some tinkering with the SlackBuild script. Continue reading

Using the replicator database in CouchDB 1.1.0

I am testing out the new replicator db features in CouchDB 1.1 (documented here), and I came across a quirk that took me a while to figure out, so I thought I’d write it up. It isn’t a bug, and it is totally consistent with the rules, but for some reason it was counter-intuitive to me.

The fundamental problem is that I am using slashes in database names. This is fine and supported, but when used in URLs the slashes have to be escaped.

The database I am replicating between machines is called vdsdata/d12/2007. Ordinarily in CouchDB, because it uses HTTP for everything, I’d have to escape that as
“vdsdata%2fd12%2f2007”. For example, if I want to get the status of the database, I’d write

curl 127.0.0.1:5984/vdsdata%2Fd12%2F2007

which will return:

{"db_name":"vdsdata/d12/2007",
 "doc_count":109,
 "doc_del_count":3,
 "update_seq":151,
 "purge_seq":0,
 "compact_running":false,
 "disk_size":479323,
 "instance_start_time":"1308159829266270",
 "disk_format_version":5,
 "committed_update_seq":151}

So this habit of always escaping the slashes is ingrained in me, and I always call a URL escape routine in my programs to escape the database names. For example, in the R code I am working on I just call tolower(paste(components,collapse='%2F')).

However, this doesn’t work in the replicator database. As documented, the replicator database entries are of the format:

{
    "_id": "doc_bar_pull",
    "source":  "http://myserver.com:5984/foo",
    "target":  "bar"
}

or going the other way

{
    "_id": "doc_bar_push",
    "source":  "bar",
    "target":  "http://myserver.com:5984/foo"
}

The docs don’t mention the odd use case of putting slashes in the database names, so I just continued to call my escaping routines and created the following replicator database entry:

{
    "_id":"pull_broken",
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata%2fd12%2f2007",
    "continuous":true
}

This spews illegal database errors in the log files, and if you create the replicator document via futon, once you save it you’ll immediately see errors in the document, as in:

{
    "_id":"pull_broken",
    "_rev":"2-3546101f93d73f8f7d3a185569b036d3",
    "continuous":true,
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata%2fd12%2f2007",
    "_replication_state":"error",
    "_replication_state_time":"2011-06-15T11:24:06-07:00",
    "_replication_id":"3540d1d63ad94edd9f0731928ebaf2b1"
}

What is going on is that internally CouchDB is not using HTTP to access its databases, and CouchDB knows that its databases are named with slashes or other funny characters. So when I escape the database name in the replicator document, CouchDB is happily doing what I asked and looking for a database with “%2F” in its name. Instead my entry into the replicator database must have the slashes for the local db, even though it still must have the escape for the remote db, since that remote database is accessed over HTTP. The correct entry looks something like:

curl 127.0.0.1:5984/_replicator/vdsdata%2Fd12%2F2007_pull
{
    "_id":"vdsdata/d12/2007_pull",
    "_rev":"18-878ae6f27325ca11a1339d6bd1f68c39",
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata/d12/2007",
    "continuous":true,
    "_replication_state":"triggered",
    "_replication_state_time":"2011-06-15T10:40:29-07:00",
    "_replication_id":"85b8199b41199783ef25048ca8913dad"
}

Now I’ve sorted that out, time to actually use replicating databases in my work!

CouchDB changes feed

Quick post that the “Asynchronous architectures with the CouchDB _changes feed” webcast hosted by O’Reilly Media was good. Unfortunately, what I really want to do is not yet possible out of the box. I want to use the GeoCouch indexer to filter the changes feed. So for example (and liberally inventing the ‘&&’ operator that mimics the same operator from PostGIS)


function(doc, req)
{
if(doc.geometry && req.query.bbox) {
return true;
}
return false;
}

Then a map application could update, say, traffic conditions for the current viewport, rather than sifting through possibly millions of traffic events every few minutes just to find the handful in my area.

The hooks are there, and it might be possible to do a workaround, but I’m not seeing it yet.

I also thought I wanted reduce in the filter, but thinking about it I’ve changed my mind. Taking the Focus app as an example, suppose I want to see who has been active recently, not the actual posted tasks. A reduce in the filter could bundle up just the people making changes, while skipping the actual changes. But I think I understand enough of the underlying architecture of CouchDB to know that there probably isn’t much advantage to computing map/reduce b-trees for changes…by definition these things are changing. If such a thing were really important, one could use the changes feed to populate another db with appropriate documents (and the change timestamp and ordering), which in turn would have whatever map/reduce stuff I need to run.

Millions of databases

One of the problems I’ve been having with couchdb is the large number of documents I need to store. My solution was to split up the databases along the time dimension and along districts, so that I have one database per (month * year * district). That makes things manageable for my head, but is less than trivial for a program to sort out.

On the other hand, most of my uses involve putting back together those 12 months in the year. So that suggests that this sharding approach is a bad one. I played around with merging up 12 months into one database, but because the view trees aren’t merged but rather have to be rebuilt, that didn’t make any sense at all, as I posted last week. Using node.js to pre-compute the views that aren’t generated yet works and is lots of fun, but not something I want to maintain beyond an interesting test of node.js.

Then today while browsing the CouchDB users mailing list, J Chris Anderson posted that “CouchDB has been tested with millions of databases on a single server, no problem, so this model is practical and supported.”

Click went my low-wattage brain, and I realized that I could be generating one database per sensor, rather than merging all the sensors into a single database. There are only a few thousand detectors at most (a lot less than millions) and this would help to reduce the size of each database such that more complicated views would become bearable. Even if I combine sensors into related groups (mainline, ramp, HOV, etc), that still would produce a much smaller database.

And the fit is better with our usage patterns. We often want to drill down or produce aggregates for a particular sensor location or even a particular lane, but it is only very rarely that we want to compare data across sensors. And when we do, it is usually something best handled in R rather than JavaScript.

RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading