Using npm with R is great

A few weeks ago I wrote up how I am using npm, the standard package manager for node.js. When I first started using node.js, and when npm first started cropping up as the best package manager to use, I was annoyed by the idea of copying libraries over and over again into each package’s node_modules directory. It seemed wasteful, since they were mostly copies, so I would generally use the -g flag and install globally.

Then I ran into trouble with versions, and I decided my insistence on using -g was stupid. Better to have the required version locally installed than to fight with multiple versions of a package at the global level.

The point is that today, in R, I need to depend on readr but the github version, not the CRAN version, because I need to match a column of times that use “AM/PM” time. In R, there isn’t a clean way to load conflicting versions of a package that I am aware of. I don’t want my programs to use the bleeding edge of readr, but I am willing to accept the devel version for this package.

Unfortunately, I’m the only person using npm to load R packages local to my project. Phooey. But I can hack my R launching script to use devtools to load the package I need locally as follows.

First, I have a standard incantation to make my runtime R find my local, node_modules-installed R libraries:

## need node_modules directories
dot_is <- getwd()
node_paths <- dir(dot_is,pattern='.Rlibs',
                  full.names=TRUE,recursive=TRUE,
                  ignore.case=TRUE,include.dirs=TRUE,
                  all.files = TRUE)
path <- normalizePath(node_paths, winslash = "/", mustWork = FALSE)
lib_paths <- .libPaths()
.libPaths(c(path, lib_paths))

This bit of code will dive down into the local node_modules directory, recursively find all of the .Rlibs directories, and prepend them to the runtime .libPaths, so that local libraries take precedence over global ones.

All I have to do is to insert a command to load the required devel-level packages before installing and testing my code. Something like:

## need node_modules directories
dot_is <- getwd()
node_paths <- dir(dot_is,pattern='.Rlibs',
                  full.names=TRUE,recursive=TRUE,
                  ignore.case=TRUE,include.dirs=TRUE,
                  all.files = TRUE)
path <- normalizePath('node_modules/.Rlibs', winslash = "/", mustWork = FALSE)
if(!file.exists('path')){
    dir.create(path)
}
.libPaths(c(path,node_paths, lib_paths))
vc <-  list(op=">=",version=package_version("0.1.1.9000"))
if(!requireNamespace(package='readr',versionCheck=vc)){
    devtools::install_github('hadley/readr')
}

I can save that as Requirements.R, and then add the following to my package.json file:

...
  "scripts": {
      "test": "/usr/bin/Rscript Rtest.R",
      "preinstall": "/usr/bin/Rscript Requirements.R",
      "install":"/usr/bin/Rscript Rinstall.R"
  },
...

That works and is cool, but extremely one-off. Better would be to add dependencies in the package.json and get them loaded automatically. My unfinished start at this is to create an entry “rDependencies” in the package.json, which npm will then expose to my script in the system environment as “npm_package_rDependencies_…”. But I have to move on and so this is unfinished as of yet:

package.json

...
  "dependencies": {
      "calvad_rscripts": "jmarca/calvad_rscripts",
      "rcouchutils":"git://github.com/jmarca/rstats_couch_utils.git",
      "configr":"git://github.com/jmarca/configr.git"
  },
  "devDependencies": {
    "should": "^6.0.1"
  },
  "rDependencies":{
      "readr":"0.1.1.9000"
  }
  "scripts": {
      "test": "/usr/bin/Rscript Rtest.R",
      "preinstall": "/usr/bin/Rscript Requirements.R",
      "install":"/usr/bin/Rscript Rinstall.R"
  },
...

script snippet to read package.json dependencies

## ideally I would plumb versions from package.json environment variables?

envrr <- Sys.getenv()
print(envrr)
dependencies <- grep(pattern='npm_package_rDependencies'
                    ,x=names(envrr),perl=TRUE,value=TRUE)
print(dependencies)
pkgs <- strsplit(x=dependencies,split='npm_package_rDependencies_')
print(pkgs)
for(i in 1:length(dependencies)){
    pkg <- pkgs[[i]][2]
    ver <- envrr[[dependencies[i]]]
    vc <-  list(op=">=",version=package_version(ver))
    print(vc)
    if(!requireNamespace(package=pkg,versionCheck=vc)){
        print('need to download')
        devtools::install_github(paste('hadley',pkg,sep='/'))
        ## whoops, need to add proper github user, repo name here
    }else{
        print(paste('got',pkg,ver,'already'))
    }
}

Really I need to specify the required development R package like:

  "rDependencies":{
      "readr":{
          "repo":"hadley/readr",
          "version":"0.1.1.9000"
      }
  },

But the hacking gets uglier and uglier because this is passed to the script as npm_package_rDependencies_readr_repo and npm_package_rDependencies_readr_version
which means my braindead regexpr and split calls will need to be tweaked and patched some more to combine the repo and the version with the package.

So, future me, you have work to do and another blog post when you get this cleaned up.

Modernizing my approach to my R packages

I’ve been using R since 2000 or so, probably earlier, off and on. I’ve always just hacked out big old spaghetti-code programs. More recently, as alluded to with this past post, I’ve been migrating to using node.js to call into R. The initial impetus was to solve a problem with memory leakage, and with a single crash disrupting a really big sequence of jobs. By setting up my big outer loops in node.js, I can now fire off as many simultaneous R jobs as my RAM can handle, and if any die, node.js can handle the errors appropriately.

The one remaining issue is that my R code was still pretty brutish. I dislike the formal R packaging stuff, and I wanted something more lightweight, more like what node.js uses. I first tried to use component, but that was the wrong choice for a number of reasons. Way back in October I toyed with the idea of using npm to package up my R code, but I didn’t really start to do that in earnest until very recently. It turns out, with just a few quirks, this works pretty well. This post outlines my general approach to using npm to load R packages.

Continue reading

Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. Continue reading

How I fire off multiple R jobs from node.js

Node.js has become my hammer of choice for most systems programming type jobs. In an earlier post I talked about how to use CouchDB to store the progress and state of jobs that need doing. Here I will demonstrate how I trigger those jobs and update CouchDB using a fairly simple node.js program.

Two key features of node that makes this program possible are spawn and being able to read and manipulate the environment variables.

var spawn = require('child_process').spawn
var env = process.env

Node.js is fully capable of using child processes. One can choose from exec, execFile, spawn, and fork. For my usage, the spawn function does exactly what I want—it creates a child process that reports back when it exits.

The other useful tool is the ability to access the current running environment using the process.env variable. This allows my program to take note of any environment variables that are already set, and to fill in any missing variables that my child process might need.

Concurrency via queue

Using spawn one can fire off as many jobs as desired. Suppose you have a machine with four cores, then calling spawn four times will efficiently use your processing power. Unfortunately it isn’t usually that simple. Instead, what typically happens is that you have a lot of separable data crunching tasks that need to be run, and you want to have four data processing jobs running at all times until the work is all done. To accomplish this, the spawn function will need to be called four times (to fill up the processors) and then will need to spawn a new job whenever one of the existing jobs finishes.

Continue reading

When R and JSON fight

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.

RJSONIO doesn’t handle NA, but gsub does (and other tales from the land of R)

 

RJSONIO is great. I’m so glad it exists and I don’t have to write it.

That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with

docs <- apply(data,1,toJSON)

it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123” with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.

Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.

Regex worked well, but I also realized that I can just drop the NA values altogether.

Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.

My final processing loop is as follows:

    uri=paste(couchdb,db,'_bulk_docs',sep="/")
    reader = nullTextGatherer()
        ... 
    jsondocs <- list('docs'=chunk)
    for( row in 1:length(chunk[,1]) ){
      keepcols <- !is.na(chunk[row,])
      jsondocs[row] <- toJSON(chunk[row,keepcols])
    }
    bulkdocs = paste('{"docs":[',paste(jsondocs, collapse=','),']}',sep='')
    ## fix JSON:  too many spaces, NA handled wrong
    gbdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
    ## this next isnot needed now that I am stripping NA entries above, but better safe
    ## gbdocs <- gsub(" NA"," null"  ,x=gbdocs  ,perl=TRUE)
    curlPerform(
                url = uri
                ,httpheader = c('Content-Type'='application/json')
                ,customrequest = "POST"
                ,postfields = gbdocs
                ,writefunction = reader$update
                )

where the variable uri is the usual CouchDB endpoint for bulk uploads.

Update

Idiot!

Perhaps this is why I blog.  I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!

Of course, I don’t need my toJSON loop!  All I need to do is create a structure and dump it that way.  I ignored my first rule of R programming:  loops are bad.

So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling.  Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store.  In short, I recoded the above loop as

  ## simple structure for toJSON to unroll the loop internally
  jsondocs <- list('docs'=chunk)
  bulkdocs = toJSON(jsondocs,collapse='')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)

I loaded some test data (the month of December for a single detector) and reeled of 50,000 rows and passed them to two routines, one the old way, one the new way.

The old way I got tired of waiting and killed the timer. R’s system.time command reported:

   Timing stopped at: 421.402 1.093 422.56 

The new way runs much faster. System.time() reports

> system.time(docstring<-jsondump2(chunk))
   user  system elapsed 
 31.658   0.157  31.843 

That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.

Update

Idiot!

Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.

Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.

I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.

My new champ is once again my old, super slow loop!

But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.

Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.

  numeric.cols <- 1:35
  text.cols <- 36:37
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  paste(num.data,text.data,collapse=',')

A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes:

 gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)

The final “fastest” function is:

numeric.cols <- 1:35
text.cols <- 36:37
jsondump4 <- function(chunk){
  num.data <- apply(chunk[,numeric.cols],1,toJSON)
  text.data <- apply(chunk[,text.cols],1,toJSON)
  bulkdocs <- gsub('} {',',',x=paste(num.data,text.data,collapse=','),perl=TRUE)
  bulkdocs <- paste('{"docs":[',bulkdocs,']}')
  ## fix JSON:  too many spaces, NA handled wrong
  bulkdocs <- gsub("\\s\\s*"," ",x=bulkdocs,perl=TRUE)
  ## this next is needed again
  bulkdocs <- gsub(" NA"," null"  ,x=bulkdocs  ,perl=TRUE)
  bulkdocs
}

On 5,000 rows, the timings are:


chunk <- docdf[1:5000,]
system.time(bulkdocs<-jsondump1(chunk)) ## original loop
   user  system elapsed 
 40.798   0.001  40.823 
 system.time(bulkdocs<-jsondump2(chunk)) ## ddply from plyr library
   user  system elapsed 
 64.620   0.002  64.655 
 system.time(bulkdocs<-jsondump3(chunk)) ## foreach from foreach library
   user  system elapsed 
130.148   0.007 130.230 
 system.time(bulkdocs<-jsondump4(chunk)) ## call apply separately for numeric, text
   user  system elapsed 
  5.737   0.000   5.740 

and on 50,000 rows:


> chunk  system.time(bulkdocs<-jsondump4(chunk))
   user  system elapsed 
 63.856   0.155  64.044 

And once again, my basic rule in struggling with R is proven true. Loops may not be bad, but if you have a loop, and you have slow code, it is worth figuring out how to get rid of that loop.

R ifelse tripped me up

In the continuing saga of climbing the R learning curve, I just found a bug in older code.

Although in hindsight I can’t see what on earth I was thinking.

I have been using ifelse as sort of a replacement for case statements. ifelse is cool because if you have a list of things with a handful of values or that you want to split on a value, you can write an ifelse statement that works on the whole list. So for example:

## assume ts is a POSIXlt object from January 1, 2008 through January 1, 2009
saturdays <- ifelse(ts.wday==6,saturdayvalue,otherdayvalue)

This isn’t a great example, because you can accomplish the same thing by indexing. There are cases when it is much more useful than indexing.

However, I made the mistake of thinking that the ifelse operator knows what its target is (what context it is operating in), but instead the operator only looks at it arguments, not its expected result. So I did something like:

> testCondition <- 2
> list.of.data <- 1:10

> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> target.list <- ifelse(testCondition==1,list.of.data,list.of.data*2)

I expected target.list to be a list of 10 items, either doubled or not, depending on the value of testCondition. In fact, I just got one item

> target.list
[1] 10

But, if you make a list condition, rather than a scalar, you get a list result. So consider this

> ## ifelse needs a list condition to generate a list output
> testCondition <- rep(1:3,5)
> testCondition
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> target.list <- ifelse(testCondition==1,list.of.data,rev(list.of.data))
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6

Which is an utterly crazy result. Even now, when I think I’m writing an example of what ifelse does, I was expecting a list of lists, with reversed lists where the condition was true. But no, ifelse generates as a result the same thing that it has for its condition…a simple vector. Looking again at the code, what that ifelse command did is that every time the testCondition vector hit 1, it drew from the normally ordered list, and every time it hit 2 or 3, it drew from the reversed the list.of.data variable, keeping the index in the vector going. So you start with a normal list at the first element, get 2 and 3 from the reversed list elements 2 and 3, then pull element 4 from the forward list at the 4th element, etc., then recycle both lists when you hit element 11 in the output vector.

> rev(list.of.data)
[1] 10  9  8  7  6  5  4  3  2  1
> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> testCondition==1
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[13]  TRUE FALSE FALSE
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6
> 

So the result is entirely based on the test value, not the expected result, or the two true or false answers. If you have a single scalar for your test, then you’ll get a scalar for your answer. If you have a list of 15 elements for your test, you’ll get a list of 15 elements for your answer, with the two conditions being recycled as and if necessary. And you can generate truly weird results if you don’t think about what you’re doing.

R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). Continue reading

RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events. Continue reading