Using the replicator database in CouchDB 1.1.0

I am testing out the new replicator db features in CouchDB 1.1 (documented here), and I came across a quirk that took me a while to figure out, so I thought I’d write it up. It isn’t a bug, and it is totally consistent with the rules, but for some reason it was counter-intuitive to me.

The fundamental problem is that I am using slashes in database names. This is fine and supported, but when used in URLs the slashes have to be escaped.

The database I am replicating between machines is called vdsdata/d12/2007. Ordinarily in CouchDB, because it uses HTTP for everything, I’d have to escape that as
“vdsdata%2fd12%2f2007″. For example, if I want to get the status of the database, I’d write

curl 127.0.0.1:5984/vdsdata%2Fd12%2F2007

which will return:

{"db_name":"vdsdata/d12/2007",
 "doc_count":109,
 "doc_del_count":3,
 "update_seq":151,
 "purge_seq":0,
 "compact_running":false,
 "disk_size":479323,
 "instance_start_time":"1308159829266270",
 "disk_format_version":5,
 "committed_update_seq":151}

So this habit of always escaping the slashes is ingrained in me, and I always call a URL escape routine in my programs to escape the database names. For example, in the R code I am working on I just call tolower(paste(components,collapse='%2F')).

However, this doesn’t work in the replicator database. As documented, the replicator database entries are of the format:

{
    "_id": "doc_bar_pull",
    "source":  "http://myserver.com:5984/foo",
    "target":  "bar"
}

or going the other way

{
    "_id": "doc_bar_push",
    "source":  "bar",
    "target":  "http://myserver.com:5984/foo"
}

The docs don’t mention the odd use case of putting slashes in the database names, so I just continued to call my escaping routines and created the following replicator database entry:

{
    "_id":"pull_broken",
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata%2fd12%2f2007",
    "continuous":true
}

This spews illegal database errors in the log files, and if you create the replicator document via futon, once you save it you’ll immediately see errors in the document, as in:

{
    "_id":"pull_broken",
    "_rev":"2-3546101f93d73f8f7d3a185569b036d3",
    "continuous":true,
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata%2fd12%2f2007",
    "_replication_state":"error",
    "_replication_state_time":"2011-06-15T11:24:06-07:00",
    "_replication_id":"3540d1d63ad94edd9f0731928ebaf2b1"
}

What is going on is that internally CouchDB is not using HTTP to access its databases, and CouchDB knows that its databases are named with slashes or other funny characters. So when I escape the database name in the replicator document, CouchDB is happily doing what I asked and looking for a database with “%2F” in its name. Instead my entry into the replicator database must have the slashes for the local db, even though it still must have the escape for the remote db, since that remote database is accessed over HTTP. The correct entry looks something like:

curl 127.0.0.1:5984/_replicator/vdsdata%2Fd12%2F2007_pull
{
    "_id":"vdsdata/d12/2007_pull",
    "_rev":"18-878ae6f27325ca11a1339d6bd1f68c39",
    "source":"http://example.com:5984/vdsdata%2fd12%2f2007",
    "target":"vdsdata/d12/2007",
    "continuous":true,
    "_replication_state":"triggered",
    "_replication_state_time":"2011-06-15T10:40:29-07:00",
    "_replication_id":"85b8199b41199783ef25048ca8913dad"
}

Now I’ve sorted that out, time to actually use replicating databases in my work!

RJSONIO doesn’t handle NA, but gsub does (and other tales from the land of R)

 

RJSONIO is great. I’m so glad it exists and I don’t have to write it.

That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with

docs <- apply(data,1,toJSON)

it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123” with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.

Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.

Regex worked well, but I also realized that I can just drop the NA values altogether.

Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.

My final processing loop is as follows:

    uri=paste(couchdb,db,'_bulk_docs',sep="/")
    reader = nullTextGatherer()
        ... 
    jsondocs <- list('docs'=chunk)
    for( row in 1:length(chunk[,1]) ){
      keepcols <- !is.na(chunk[row,])
      jsondocs[row] <- toJSON(chunk[row,keepcols])
    }
    bulkdocs = paste('{"docs":[',paste(jsondocs, collapse=','),']}',sep='')
    ## fix JSON:  too many spaces, NA handled wrong
    gbdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
    ## this next isnot needed now that I am stripping NA entries above, but better safe
    ## gbdocs <- gsub(" NA"," null"  ,x=gbdocs  ,perl=TRUE)
    curlPerform(
                url = uri
                ,httpheader = c('Content-Type'='application/json')
                ,customrequest = "POST"
                ,postfields = gbdocs
                ,writefunction = reader$update
                )

where the variable uri is the usual CouchDB endpoint for bulk uploads.

Update

Idiot!

Perhaps this is why I blog.  I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!

Of course, I don’t need my toJSON loop!  All I need to do is create a structure and dump it that way.  I ignored my first rule of R programming:  loops are bad.

So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling.  Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store.  In short, I recoded the above loop as

  ## simple structure for toJSON to unroll the loop internally
  jsondocs <- list('docs'=chunk)
  bulkdocs = toJSON(jsondocs,collapse='')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)

I loaded some test data (the month of December for a single detector) and reeled of 50,000 rows and passed them to two routines, one the old way, one the new way.

The old way I got tired of waiting and killed the timer. R’s system.time command reported:

   Timing stopped at: 421.402 1.093 422.56 

The new way runs much faster. System.time() reports

> system.time(docstring<-jsondump2(chunk))
   user  system elapsed 
 31.658   0.157  31.843 

That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.

Update

Idiot!

Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.

Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.

I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.

My new champ is once again my old, super slow loop!

But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.

Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.

  numeric.cols <- 1:35
  text.cols <- 36:37
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  paste(num.data,text.data,collapse=',')

A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes:

 gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)

The final “fastest” function is:

numeric.cols <- 1:35
text.cols <- 36:37
jsondump4 <- function(chunk){
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  bulkdocs <- gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)
  bulkdocs <- paste('{"docs":[',bulkdocs,']}')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)
  bulkdocs
}

On 5,000 rows, the timings are:


chunk <- docdf[1:5000,]
system.time(bulkdocs<-jsondump1(chunk)) ## original loop
   user  system elapsed 
 40.798   0.001  40.823 
 system.time(bulkdocs<-jsondump2(chunk)) ## ddply from plyr library
   user  system elapsed 
 64.620   0.002  64.655 
 system.time(bulkdocs<-jsondump3(chunk)) ## foreach from foreach library
   user  system elapsed 
130.148   0.007 130.230 
 system.time(bulkdocs<-jsondump4(chunk)) ## call apply separately for numeric, text
   user  system elapsed 
  5.737   0.000   5.740 

and on 50,000 rows:


> chunk  system.time(bulkdocs<-jsondump4(chunk))
   user  system elapsed 
 63.856   0.155  64.044 

And once again, my basic rule in struggling with R is proven true. Loops may not be bad, but if you have a loop, and you have slow code, it is worth figuring out how to get rid of that loop.