When R and JSON fight

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.

About these ads

4 thoughts on “When R and JSON fight

  1. Very interesting post. I am trying to start a project using an existing couchdb dataset and R. Is there documentation you’ve read to help get you started. What is really tripping me up is how to work within the keys. The way my data is set up, would be as if there were one key/tag pair for nl1, nr1, and ts under the field “data” from your example. More explicitly, {“data”:{“nl1″:[2], “nr1″:[3], “ts”:[4]}. I want to be able to import nl1, nr1, and ts from the data key as vectors like below:
    nl1 nr1 ts
    doc1 2 3 4
    doc2 . . .

    Is this similar to how your data is set up? Any help you can give me or documentation you can point me toward would be much appreciated. After I’m done with this project, I plan on putting up some tutorials myself because finding information has been difficult but I have to figure out how to get R to do what I want first!

  2. I am far from an expert on R, but what I find is that it is best to think about it in terms of lists. Also, I tend towards lists of lists for data in CouchDB because my datasets are large (hundreds of GB scale) and all those extra words in a JSON object such as {"a":100,"b":200} tend to add a lot to the total db size when you have tens of millions of documents.

    But for smaller projects, using objects with labels makes sense, but makes it slightly more difficult to push into R.

    What you need to do is process all of your docs and extract the values into distinct lists. There are lots of ways to do this. I tend towards using lapply and sapply myself, but there is also a pretty good library called plyr that you might want to check out. I use a call to a view or to all docs to get view output rows or all the documents between a start key and an end key. The important point is to get lots of documents at once to work on. Then after extracting the Couch stuff from the JSON transport into R, I then process the resulting lists somehow…How exactly depends on the format of the CouchDB document. Usually there is some mix of lapply or sapply, the occasional unlist, and liberal use of inside knowledge on how the CouchDB data is set up.

    I know that isn’t helpful, but I think I need to write a full post to explain how I process CouchDB docs into R…Unless you beat me to it.

  3. Thanks for getting back to me. If you have time to write up a post, please do! If I do manage to figure it out, I probably won’t put up a post for another couple months after my research is done. Good luck to you.

  4. I had the same problem you had and stumbled across this page. If anyone else faces the same challenge, the package RSONIO function fromJSON() can now take the argument nullValue = NA which solves so many problems!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s