I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)
Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).
Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON()
function will happily convert JSON null values into R NULL values. My little getCouch()
function looks like this:
couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){
if(length(db)>1){
db <- couch.makedbname(db)
}
uri <- paste(couchdb,db,docname,sep="/");
if(local) uri <- paste(localcouchdb,db,docname,sep="/");
## hack to url encode spaces
uri <- gsub("\\s","%20",x=uri,perl=TRUE)
fromJSON(getURL(uri,curl=h)[[1]])
}
The key line is the last one, where the results of RCurl’s getURL()
function are passed directly to RJSONIO’s fromJSON()
and then returned to the caller.
In my database, to save space, each document is a list of lists for a day.
{
"_id": "1213686 2007-02-28 10:38:30",
"_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
"data": {
"nl1": [9,12,12, ... ],
"nr1": [ ... ],
...
"ts" : [ ... ]
}
}
Every entry in the ts
list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.
But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:
alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <- grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <- grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))
First I grab just the data
object, pull off the variables of interest, then make a matrix out of the data.
The problem is that the recursive application of unlist
buried in the matrix command. The alldata
object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist
SILENTLY wipes out the NULL values (So IRRITATING!)
Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy
method inline and don’t get worked up about it anymore.)
## first, make NULL into NA
intermediate <- lapply(alldata[varnames],function(l){
nullmask <- unlist(lapply(l, is.null))
l[nullmask] <- NA
l
})
## then do the unlisting
data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))
Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.
But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.
With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-error
s returned from my try
call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.