Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. The grids are processed one day per model run, and I’m storing the results in CouchDB. To check if grid/day combo is done, I can just go look for the appropriate document in CouchDB—if it is there, the grid and day can be skipped, and if the get returns an error, then I know the grid needs processing. When I was developing the code, it made perfect sense to do it this way. Generally I develop R with the R interpreter open, so I can write code, run it, and see if it does what I expect. During testing I was looking at one grid cell at a time. Doing the following direct query of CouchDB for one doc made sense because I was only looking for one doc at that time.


couch.test.date <- paste(paste(year,checkmonth,checkday,sep='-'),"00%3A00",sep='%20')
couch.test.docid <- paste( (df.hpms.grids[cell,])$geo_id,couch.test.date,sep='_')
test.doc.json <- couch.get('grid%2Fstate4k%2Fhpms',couch.test.docid,local=TRUE)
if('error' %in% names(test.doc.json) ){
    hpmstodo[cell] <- TRUE  ##true means need to do this document
    print(paste('need to do',couch.test.doc))
} else {
    hpmstodo[cell] <- FALSE ##false means doc is dropped from index
}

The comedy

The problem came when I made that above code into a loop. I am of the uneducated and ignorant opinion that R syntax sucks ass. So sometimes I do really stupid things because I can’t do what I really want to do. I hate trying to figure out how to avoid loops in R, so I just stuck the above into a loop:

couch.test.date <- paste(paste(year,checkmonth,checkday,sep='-'),"00%3A00",sep='%20')

## loop over cells,  figure the todo list
picker <- 1:length(df.hpms.grids[,1])
hpmstodo <- picker > 0 ## default condition is TRUE
for(cell in picker){
    ## set todo to FALSE if already done in couchdb
    couch.test.docid <- paste( (df.hpms.grids[cell,])$geo_id,couch.test.date,sep='_')
    test.doc.json <- couch.get('grid%2Fstate4k%2Fhpms',couch.test.docid,local=TRUE)
    if('error' %in% names(test.doc.json) ){
        ## do this doc, no need to change the index
    } else {
        hpmstodo[cell] <- FALSE ##false means doc is dropped from index
    }
}

Cue the knowing snickers from both the R code mavens and the UNIX graybeards

Well, that is slow, but it works. Cue up a few hundred docs, hit CouchDB a few hundred times, and you have a TRUE/FALSE index that can be used to drop all of the grid cells that have already been finished for that day. In fact, this worked fine for a long time (weeks of watching my computers churn through urbanized areas in California). What I discovered yesterday was that the code only worked because of a happy coincidence of smallish clusters of grid cells being processed at any one time. Yesterday I started processing larger clusters of cells, with 1,000 to 2,000 cells in them. Given that I’m using plyr and doMC to parallel process the data, that means I was firing off 6,000 to 12,000 queries to CouchDB in rapid succession.

Some guy in the audience starts to laugh really hard, one of those braying horse laughs that makes everybody else laugh at the guy. The UNIX guys, disgusted, get up and walk out.

The error messages

Well my code started failing randomly, sometimes choking right away, other times getting a few days in then dying. The error was some mysterious complaint.

As it turns out, I deleted the original log files, and, for the benefit of my future self, (Hi future me, are you still rockin’ hard?), I actually reverted the code to try to reproduce them. I show two below. I thought I had another one but I got tired of trying to recreate it.

The first error just shows the connection failing:

Error in .fun(d, ...) :
  task 13 failed - "error in evaluating the argument 'content' in selecting a method for function 'fromJSON': Error in function (type, msg, asError = TRUE)  :
  Recv failure: Connection reset by peer
Calls: getURL -> curlPerform -> .Call -> <Anonymous> -> fun
"
Calls: runme ... process.data.by.day -> d_ply -> l_ply -> %dopar% -> <Anonymous>
Execution halted

The second error has a slight wrinkle to it, in that it is a connect
failure, not a receive failure.

Error in .fun(d, ...) :
  task 10 failed - "error in evaluating the argument 'content' in selecting a method for function 'fromJSON': Error in function (type, msg, asError = TRUE)  :
  Failed to connect to 127.0.0.1: Cannot assign requested address
Calls: getURL -> curlPerform -> .Call -> <Anonymous> -> fun
"
Calls: runme ... process.data.by.day -> d_ply -> l_ply -> %dopar% -> <Anonymous>
Execution halted

These errors are pretty thin for hints, but the second one actually led me to the solution eventually.

Searching down the wrong avenue

Given that I’ve recently had issues with CouchDB, I figured maybe it was Couch’s fault. I recompliled Erlang and CouchDB, and made sure permissions were sorted and so on. I noticed that there were lots of error in the start up messages from CouchDB, but after 30 minutes reading through mailing lists, it turns out those messages are completely benign and probably due to the fact that I’m using a non-release version of Erlang. Anyway, CouchDB was running fine, and the error kept happening.

Next I thought maybe CouchDB was getting clogged from handling too many replication requests at once. Recall the code used to work fine, so I thought maybe the error was due to the recent change I made in how I was replicating data between my machines. I eventually turned off all replications to the box that was failing, and ran my job. The error kept happening.

I discovered that I could make the error go away by reducing the parallelism in my R code. Specifically, doMC has an initializing function registerDoMC() that takes an optional argument that sets the number of cores to use. If I passed in 2, the error didn’t happen, but what the hell, 4 cores were sitting idle.

A clue!

Finally I had some luck and ran across a thread in a PHP forum that noted for a similar error message that the problem was that Linux was running out of sockets. Apparently, establishing lots of connections uses up sockets. Although they get reused, the OS takes a non-infinitesimal amount of time to recycle each socket. Obviously I was checking a lot of docs, but I thought (following the docs for RCurl) that I was reusing a single curl handle and that was good enough. Clearly I thought wrong, as hard-coding every connection to CouchDB to use exactly the same curl handle didn’t fix the crashing.

The fix

So I bit the bullet and recoded my stupid loop to be just a single query to _all_docs. CouchDB has an awesome feature that you can POST a list of document ids to a database’s _all_docs endpoint, and it will return those documents. So instead of hitting the database n times to get n docs, you can write out a big JSON string that lists all of the n document ids. Specifically, the syntax is to pass a JSON object that looks like

curl -d '{"keys":["100_223_2007-01-01 00:00","id2","id3"]}' \
 -X POST http://127.0.0.1:5984/grid%2Fstate4k%2Fhpms/_all_docs?include_docs=false

You get back an object that looks like this:

{"total_rows":209539954,"offset":0,"rows":[
    {"id":"100_223_2007-01-01 00:00","key":"100_223_2007-01-01 00:00","value":{"rev":"1-60e29f2c7a720c3f34657d9db705181f"}},
    {"key":"id2","error":"not_found"},
    {"key":"id3","error":"not_found"}
]}

The problem I was having with that approach is that, as I said above, R syntax blows chunks. I have a list of document ids, so my immediate mental solution is to stuff that list into a hash and call toJSON(), which fails in R. No, R’s lists are weird, and toJSON() never does what I want, with wrong escapes, inconsistent quoting and so on. Desperation is a strong motivator, however, so I calmed down, fired up R, and tested out some ways to construct the necessary JSON string directly. The solution is in two parts, below:


## okay, some bits of R syntax are cool.  Like this:
couch.test.docs <- paste(df.pred.keys$geo_id,couch.test.date,sep='_')
## that makes a list of strings all at once, mashing the date with the geoid

result <- couch.allDocsPost('grid%2Fstate4k%2Fhpms',couch.test.docs,include.docs=FALSE)
## process the result
rows <- result$rows
picker <- 1:length(df.pred.keys$geo_id)
hpmstodo <- picker < 0 # default false
## this loop is likely unnecessary, but I've had enough R for now
for(cell in picker){
    row <- rows[[cell]] ## stupid list syntax
    if('error' %in% names(row)){
         hpmstodo[cell] <- TRUE  ##true means need to do this document
     }
}
## okay, this true/false index thing is also cool
picked <- picker[hpmstodo]

At the end of the loop, I have an index picked that identifies the grid cells that need to be processed for the day in question. It takes just a little less time to run the query than pounding CouchDB with individual requests, but the best part is that I never run out of sockets with this code, as the hits come fewer and slower.

The other half of the solution is in the function couch.allDocsPost that looks like this:


couch.allDocsPost <- function(db, keys, view='_all_docs', include.docs = TRUE, h=getCurlHandle()){
  uri <- paste(cdb,db,view,sep="/");
  k <- paste('{"keys":["',
             paste(keys,collapse='","'),
             '"]}',sep='')  ## this need for sep='' is sucktastic
  if(include.docs){
      uri <- paste(uri,'include_docs=true',sep='?')
  }
  reader <- basicTextGatherer()
  curlPerform(
              url = uri
              ,customrequest = "POST"
              ,httpheader = c('Content-Type'='application/json')
              ,postfields = k
              ,writefunction = reader$update
              ,curl=h
              )
  fromJSON(reader$value()[[1]],simplify=FALSE)
}

The trick here is the funky double paste command that makes the JSON string, and the need for the stupid sep='' parameter to tell paste to not add spaces between its arguments. Every other language I use defaults to no spaces and uses the syntax join, but again, R is special in some ways.

Conclusion

The moral of the story is that even if you suck at programming R, and suck at understanding Linux systems programming, you can still take a deep breath and solve your cryptic error message with the help of the Internet and patience.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s