RJSONIO doesn’t handle NA, but gsub does (and other tales from the land of R)

 

RJSONIO is great. I’m so glad it exists and I don’t have to write it.

That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with

docs <- apply(data,1,toJSON)

it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123” with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.

Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.

Regex worked well, but I also realized that I can just drop the NA values altogether.

Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.

My final processing loop is as follows:

    uri=paste(couchdb,db,'_bulk_docs',sep="/")
    reader = nullTextGatherer()
        ... 
    jsondocs <- list('docs'=chunk)
    for( row in 1:length(chunk[,1]) ){
      keepcols <- !is.na(chunk[row,])
      jsondocs[row] <- toJSON(chunk[row,keepcols])
    }
    bulkdocs = paste('{"docs":[',paste(jsondocs, collapse=','),']}',sep='')
    ## fix JSON:  too many spaces, NA handled wrong
    gbdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
    ## this next isnot needed now that I am stripping NA entries above, but better safe
    ## gbdocs <- gsub(" NA"," null"  ,x=gbdocs  ,perl=TRUE)
    curlPerform(
                url = uri
                ,httpheader = c('Content-Type'='application/json')
                ,customrequest = "POST"
                ,postfields = gbdocs
                ,writefunction = reader$update
                )

where the variable uri is the usual CouchDB endpoint for bulk uploads.

Update

Idiot!

Perhaps this is why I blog.  I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!

Of course, I don’t need my toJSON loop!  All I need to do is create a structure and dump it that way.  I ignored my first rule of R programming:  loops are bad.

So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling.  Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store.  In short, I recoded the above loop as

  ## simple structure for toJSON to unroll the loop internally
  jsondocs <- list('docs'=chunk)
  bulkdocs = toJSON(jsondocs,collapse='')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)

I loaded some test data (the month of December for a single detector) and reeled of 50,000 rows and passed them to two routines, one the old way, one the new way.

The old way I got tired of waiting and killed the timer. R’s system.time command reported:

   Timing stopped at: 421.402 1.093 422.56 

The new way runs much faster. System.time() reports

> system.time(docstring<-jsondump2(chunk))
   user  system elapsed 
 31.658   0.157  31.843 

That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.

Update

Idiot!

Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.

Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.

I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.

My new champ is once again my old, super slow loop!

But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.

Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.

  numeric.cols <- 1:35
  text.cols <- 36:37
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  paste(num.data,text.data,collapse=',')

A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes:

 gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)

The final “fastest” function is:

numeric.cols <- 1:35
text.cols <- 36:37
jsondump4 <- function(chunk){
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  bulkdocs <- gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)
  bulkdocs <- paste('{"docs":[',bulkdocs,']}')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)
  bulkdocs
}

On 5,000 rows, the timings are:


chunk <- docdf[1:5000,]
system.time(bulkdocs<-jsondump1(chunk)) ## original loop
   user  system elapsed 
 40.798   0.001  40.823 
 system.time(bulkdocs<-jsondump2(chunk)) ## ddply from plyr library
   user  system elapsed 
 64.620   0.002  64.655 
 system.time(bulkdocs<-jsondump3(chunk)) ## foreach from foreach library
   user  system elapsed 
130.148   0.007 130.230 
 system.time(bulkdocs<-jsondump4(chunk)) ## call apply separately for numeric, text
   user  system elapsed 
  5.737   0.000   5.740 

and on 50,000 rows:


> chunk  system.time(bulkdocs<-jsondump4(chunk))
   user  system elapsed 
 63.856   0.155  64.044 

And once again, my basic rule in struggling with R is proven true. Loops may not be bad, but if you have a loop, and you have slow code, it is worth figuring out how to get rid of that loop.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s