Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. Continue reading

How I fire off multiple R jobs from node.js

Node.js has become my hammer of choice for most systems programming type jobs. In an earlier post I talked about how to use CouchDB to store the progress and state of jobs that need doing. Here I will demonstrate how I trigger those jobs and update CouchDB using a fairly simple node.js program.

Two key features of node that makes this program possible are spawn and being able to read and manipulate the environment variables.

var spawn = require('child_process').spawn
var env = process.env

Node.js is fully capable of using child processes. One can choose from exec, execFile, spawn, and fork. For my usage, the spawn function does exactly what I want—it creates a child process that reports back when it exits.

The other useful tool is the ability to access the current running environment using the process.env variable. This allows my program to take note of any environment variables that are already set, and to fill in any missing variables that my child process might need.

Concurrency via queue

Using spawn one can fire off as many jobs as desired. Suppose you have a machine with four cores, then calling spawn four times will efficiently use your processing power. Unfortunately it isn’t usually that simple. Instead, what typically happens is that you have a lot of separable data crunching tasks that need to be run, and you want to have four data processing jobs running at all times until the work is all done. To accomplish this, the spawn function will need to be called four times (to fill up the processors) and then will need to spawn a new job whenever one of the existing jobs finishes.

Continue reading

When R and JSON fight

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.

RJSONIO doesn’t handle NA, but gsub does (and other tales from the land of R)

 

RJSONIO is great. I’m so glad it exists and I don’t have to write it.

That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with

docs <- apply(data,1,toJSON)

it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123″ with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.

Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.

Regex worked well, but I also realized that I can just drop the NA values altogether.

Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.

My final processing loop is as follows:

    uri=paste(couchdb,db,'_bulk_docs',sep="/")
    reader = nullTextGatherer()
        ... 
    jsondocs <- list('docs'=chunk)
    for( row in 1:length(chunk[,1]) ){
      keepcols <- !is.na(chunk[row,])
      jsondocs[row] <- toJSON(chunk[row,keepcols])
    }
    bulkdocs = paste('{"docs":[',paste(jsondocs, collapse=','),']}',sep='')
    ## fix JSON:  too many spaces, NA handled wrong
    gbdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
    ## this next isnot needed now that I am stripping NA entries above, but better safe
    ## gbdocs <- gsub(" NA"," null"  ,x=gbdocs  ,perl=TRUE)
    curlPerform(
                url = uri
                ,httpheader = c('Content-Type'='application/json')
                ,customrequest = "POST"
                ,postfields = gbdocs
                ,writefunction = reader$update
                )

where the variable uri is the usual CouchDB endpoint for bulk uploads.

Update

Idiot!

Perhaps this is why I blog.  I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!

Of course, I don’t need my toJSON loop!  All I need to do is create a structure and dump it that way.  I ignored my first rule of R programming:  loops are bad.

So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling.  Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store.  In short, I recoded the above loop as

  ## simple structure for toJSON to unroll the loop internally
  jsondocs <- list('docs'=chunk)
  bulkdocs = toJSON(jsondocs,collapse='')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)

I loaded some test data (the month of December for a single detector) and reeled of 50,000 rows and passed them to two routines, one the old way, one the new way.

The old way I got tired of waiting and killed the timer. R’s system.time command reported:

   Timing stopped at: 421.402 1.093 422.56 

The new way runs much faster. System.time() reports

> system.time(docstring<-jsondump2(chunk))
   user  system elapsed 
 31.658   0.157  31.843 

That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.

Update

Idiot!

Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.

Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.

I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.

My new champ is once again my old, super slow loop!

But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.

Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.

  numeric.cols <- 1:35
  text.cols <- 36:37
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  paste(num.data,text.data,collapse=',')

A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes:

 gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)

The final “fastest” function is:

numeric.cols <- 1:35
text.cols <- 36:37
jsondump4 <- function(chunk){
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  bulkdocs <- gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)
  bulkdocs <- paste('{"docs":[',bulkdocs,']}')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)
  bulkdocs
}

On 5,000 rows, the timings are:


chunk <- docdf[1:5000,]
system.time(bulkdocs<-jsondump1(chunk)) ## original loop
   user  system elapsed 
 40.798   0.001  40.823 
 system.time(bulkdocs<-jsondump2(chunk)) ## ddply from plyr library
   user  system elapsed 
 64.620   0.002  64.655 
 system.time(bulkdocs<-jsondump3(chunk)) ## foreach from foreach library
   user  system elapsed 
130.148   0.007 130.230 
 system.time(bulkdocs<-jsondump4(chunk)) ## call apply separately for numeric, text
   user  system elapsed 
  5.737   0.000   5.740 

and on 50,000 rows:


> chunk  system.time(bulkdocs<-jsondump4(chunk))
   user  system elapsed 
 63.856   0.155  64.044 

And once again, my basic rule in struggling with R is proven true. Loops may not be bad, but if you have a loop, and you have slow code, it is worth figuring out how to get rid of that loop.

R ifelse tripped me up

In the continuing saga of climbing the R learning curve, I just found a bug in older code.

Although in hindsight I can’t see what on earth I was thinking.

I have been using ifelse as sort of a replacement for case statements. ifelse is cool because if you have a list of things with a handful of values or that you want to split on a value, you can write an ifelse statement that works on the whole list. So for example:

## assume ts is a POSIXlt object from January 1, 2008 through January 1, 2009
saturdays <- ifelse(ts.wday==6,saturdayvalue,otherdayvalue)

This isn’t a great example, because you can accomplish the same thing by indexing. There are cases when it is much more useful than indexing.

However, I made the mistake of thinking that the ifelse operator knows what its target is (what context it is operating in), but instead the operator only looks at it arguments, not its expected result. So I did something like:

> testCondition <- 2
> list.of.data <- 1:10

> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> target.list <- ifelse(testCondition==1,list.of.data,list.of.data*2)

I expected target.list to be a list of 10 items, either doubled or not, depending on the value of testCondition. In fact, I just got one item

> target.list
[1] 10

But, if you make a list condition, rather than a scalar, you get a list result. So consider this

> ## ifelse needs a list condition to generate a list output
> testCondition <- rep(1:3,5)
> testCondition
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> target.list <- ifelse(testCondition==1,list.of.data,rev(list.of.data))
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6

Which is an utterly crazy result. Even now, when I think I’m writing an example of what ifelse does, I was expecting a list of lists, with reversed lists where the condition was true. But no, ifelse generates as a result the same thing that it has for its condition…a simple vector. Looking again at the code, what that ifelse command did is that every time the testCondition vector hit 1, it drew from the normally ordered list, and every time it hit 2 or 3, it drew from the reversed the list.of.data variable, keeping the index in the vector going. So you start with a normal list at the first element, get 2 and 3 from the reversed list elements 2 and 3, then pull element 4 from the forward list at the 4th element, etc., then recycle both lists when you hit element 11 in the output vector.

> rev(list.of.data)
[1] 10  9  8  7  6  5  4  3  2  1
> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> testCondition==1
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[13]  TRUE FALSE FALSE
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6
> 

So the result is entirely based on the test value, not the expected result, or the two true or false answers. If you have a single scalar for your test, then you’ll get a scalar for your answer. If you have a list of 15 elements for your test, you’ll get a list of 15 elements for your answer, with the two conditions being recycled as and if necessary. And you can generate truly weird results if you don’t think about what you’re doing.

R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). Continue reading

RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading