RJSONIO is great. I’m so glad it exists and I don’t have to write it.
That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with
docs <- apply(data,1,toJSON)
it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123” with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.
Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.
Regex worked well, but I also realized that I can just drop the NA values altogether.
Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.
My final processing loop is as follows:
uri=paste(couchdb,db,'_bulk_docs',sep="/") reader = nullTextGatherer() ... jsondocs <- list('docs'=chunk) for( row in 1:length(chunk[,1]) ){ keepcols <- !is.na(chunk[row,]) jsondocs[row] <- toJSON(chunk[row,keepcols]) } bulkdocs = paste('{"docs":[',paste(jsondocs, collapse=','),']}',sep='') ## fix JSON: too many spaces, NA handled wrong gbdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE) ## this next isnot needed now that I am stripping NA entries above, but better safe ## gbdocs <- gsub(" NA"," null" ,x=gbdocs ,perl=TRUE) curlPerform( url = uri ,httpheader = c('Content-Type'='application/json') ,customrequest = "POST" ,postfields = gbdocs ,writefunction = reader$update )
where the variable uri
is the usual CouchDB endpoint for bulk uploads.
Update
Idiot!
Perhaps this is why I blog. I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!
Of course, I don’t need my toJSON loop! All I need to do is create a structure and dump it that way. I ignored my first rule of R programming: loops are bad.
So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling. Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store. In short, I recoded the above loop as
## simple structure for toJSON to unroll the loop internally jsondocs <- list('docs'=chunk) bulkdocs = toJSON(jsondocs,collapse='') ## fix JSON: too many spaces, NA handled wrong bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE) ## this next is needed again bulkdocs <- gsub(" NA"," null" ,x=bulkdocs ,perl=TRUE)
I loaded some test data (the month of December for a single detector) and reeled of 50,000 rows and passed them to two routines, one the old way, one the new way.
The old way I got tired of waiting and killed the timer. R’s system.time command reported:
Timing stopped at: 421.402 1.093 422.56
The new way runs much faster. System.time() reports
> system.time(docstring<-jsondump2(chunk)) user system elapsed 31.658 0.157 31.843
That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.
Update
Idiot!
Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.
Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.
I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.
My new champ is once again my old, super slow loop!
But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.
Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.
numeric.cols <- 1:35 text.cols <- 36:37 num.data <- apply(chunk[,numeric.cols],1,toJSON) text.data <- apply(chunk[,text.cols],1,toJSON) paste(num.data,text.data,collapse=',')
A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes:
gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)
The final “fastest” function is:
numeric.cols <- 1:35 text.cols <- 36:37 jsondump4 <- function(chunk){ num.data <- apply(chunk[,numeric.cols],1,toJSON) text.data <- apply(chunk[,text.cols],1,toJSON) bulkdocs <- gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE) bulkdocs <- paste('{"docs":[',bulkdocs,']}') ## fix JSON: too many spaces, NA handled wrong bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE) ## this next is needed again bulkdocs <- gsub(" NA"," null" ,x=bulkdocs ,perl=TRUE) bulkdocs }
On 5,000 rows, the timings are:
chunk <- docdf[1:5000,] system.time(bulkdocs<-jsondump1(chunk)) ## original loop user system elapsed 40.798 0.001 40.823 system.time(bulkdocs<-jsondump2(chunk)) ## ddply from plyr library user system elapsed 64.620 0.002 64.655 system.time(bulkdocs<-jsondump3(chunk)) ## foreach from foreach library user system elapsed 130.148 0.007 130.230 system.time(bulkdocs<-jsondump4(chunk)) ## call apply separately for numeric, text user system elapsed 5.737 0.000 5.740
and on 50,000 rows:
> chunk system.time(bulkdocs<-jsondump4(chunk)) user system elapsed 63.856 0.155 64.044
And once again, my basic rule in struggling with R is proven true. Loops may not be bad, but if you have a loop, and you have slow code, it is worth figuring out how to get rid of that loop.