R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). I was using the time series class zoo, and tried to use its “rollapply” method to apply a method to consecutive triples of observations. My actual task is to filter out jumps in raw data from traffic loops that might have resulted from a pattern I’ve observed where two observations are repeated, and then the third is approximately double the first two. The root cause is a hardware fault, where the polling just fails to get to each sensor, and so the sensor continues to collect for 60 seconds rather than 30, and so its data is roughly double its neighbors, and the missed poll reports exactly the data from the previous 30s period. (Yay Caltrans.)

So I have a year of data, for about 2,000 sensors, and I want a fast routine to roll through triples. Here is my first effort (after finding that the data I’m working with does not exhibit exact duplicates, but still has wild outliers):

look.for.problems <- function(df){
  ## I want three lines at a time                                                                                                                              
  result <- vector(length=lanes)
  for(i in 1:lanes){
     result[i] <-
      (df[3,paste('n',i,sep='')] > 30 ) |
     (df[3,paste('o',i,sep='')] > .05 &
     df[3,paste('n',i,sep='')] > 15  &
     df[3,paste('n',i,sep='')] > 1.9*df[1,paste('n',i,sep='')] &
     df[3,paste('n',i,sep='')] > 1.9*df[2,paste('n',i,sep='')]) |
    (df[3,paste('n',i,sep='')] > 0 & df[3,paste('o',i,sep='')] == 0 ) |
    (df[3,paste('n',i,sep='')] == 0 & df[3,paste('o',i,sep='')] > 0 )

I called that routine as follows:

doubles.df <- matrix(rep(FALSE,2*lanes),ncol=lanes)
for (i in 3:10000){
  doubles.df <- rbind(doubles.df,look.for.problems.2(df.mi.input[(i-2):i,]))

So that loop runs through each row of the data, sending of three lines to the routine, which in turn runs over each “lane” to test whether the third volume and occupancy in the set are insane relative to the first two.

As you might expect, this is slow. For 10,000 rows in the test loop shown above, this routine takes

   user  system elapsed
 77.563   0.106  77.685

Slow with a capital “old age”, considering there are 900,000+ points for just this sensor for the year 2007.

R syntax is all about loop avoidance, so I thought to myself, how can I skip the loop over the lanes in order to speed up that inner loop. Following my trusty approach to thinking in R, I first tried to see how I could do the three operations as one. Here is my second attempt at the inner routine (it is called look.for.problems.3 because version 2 was a tweak of the logical conditions, not an attempt at a speed up, and I’m too lazy to rewrite my code just for this write up)

look.for.problems.3 <- function(df){
  df.n <- df[,seq(2,lanes*2,2)]
  df.o <- df[,seq(3,lanes*2 +1,2)]
  df.n[3,] > 30  |
    (df.o[3,] > .05 &
     df.n[3,] > 15  &
     df.n[3,] > 1.9*df.n[1,] &
     df.n[3,] > 1.9*df.n[2,]) |
    (df.n[3,] > 0 & df.o[3,] == 0 ) |
    (df.n[3,] == 0 & df.o[3,] > 0 )

Here I break the data frame df up into two parts, one for volume, and the other for occupancy. This simplifies things because in R you can say things like df.n[3,] > 30 and get as many answers as there are elements of df.n[3,] (in this case, there are 6 lanes, so that’s 6 FALSE or TRUE responses). So great, I removed one explicit loop in the inner loop, and maybe R would optimize things and make this version slightly faster.

150 seconds later I could see that I don’t know diddly/squat about R. So, perhaps the copying of the data frame into two data frames was the significant slow down. Instead of splitting it up explicitly, the same thing could be done by just applying the indexes to the original data frame. Furthermore, those indexes are the same for every iteration, so they can be defined outside of the loop. Version 3:

n.idx <- seq(2,lanes*2,2)
o.idx <- seq(3,lanes*2 +1,2)
look.for.problems.4 <- function(df){
  df[3,n.idx] > 30   |
    (df[3,o.idx] > .05 &
     df[3,n.idx] > 15  &
     df[3,n.idx] > 1.9*df[1,n.idx] &
     df[3,n.idx] > 1.9*df[2,n.idx]) |
    (df[3,n.idx] > 0 & df[3,o.idx] == 0 ) |
    (df[3,n.idx] == 0 & df[3,o.idx] > 0 )

So instead of df.n, I now have df[3,n.idx], and so on. Survey says…

doubles.df <- matrix(rep(FALSE,2*lanes),ncol=lanes)
system.time(for (i in 3:10000){
  doubles.df <- rbind(doubles.df,look.for.problems.4(df.mi.input[(i-2):i,]))
##  user  system elapsed
## 158.570   0.020 158.635
## still super slow!! 

Once again, my speedup ended up being a slow up. But at least the delay was only a measly 8 seconds, rather than the doubling my prior speed up caused. Still, I felt I was on the right track, because the R way is to avoid loops and use indexes and direct vector and array operations everywhere. So I sat and thought some more, and realized that there were likely two causes for the my “optimized” routine to be going slower than my “naive” routine. Specifically, I think that the way I wrote the naive routine I was naturally avoiding most of the comparisons, whereas in my vectorized routine, I wasn’t so sure that each comparison wasn’t being applied to every row. I was pretty sure I could get a speedup by using nested ifelse functions, rather than a big boolean assignment statement. Second, I still had that outer loop, and by picking apart the inner loop, I was beginning to see how I could pick apart the outer loop.

My outer loop started three observations in, and then sent off three consecutive observations to the inner routine. Staring at the code for the inner routine, it seemed to me that all I needed to do was set up three consecutive indices, each one less than the prior, and then that inner loop could be applied directly to the data, thus eliminating the outer loop. My code was as follows:

n.idx <- seq(2,lanes*2,2)
o.idx <- seq(3,lanes*2 +1,2)
look.for.problems.5 <- function(df){
  idx.3 <- 3:length(df[,1])
  idx.2 <- idx.3-1
  idx.1 <- idx.2-1
  problems  <- ifelse(df[idx.3,n.idx] > 30,TRUE,
         ## problem2  
         ifelse((df[idx.3,o.idx] > .04 &
                 df[idx.3,n.idx] > 15  &
                 df[idx.3,n.idx] > 1.9*df[idx.1,n.idx] &
                 df[idx.3,o.idx] > 2*df[idx.1,o.idx] &
                 df[idx.3,n.idx] > 1.9*df[idx.2,n.idx] &
                 df[idx.3,o.idx] > 2*df[idx.2,o.idx])
                ## flaky data conditions
                ifelse((df[idx.3,n.idx] > 0 & df[idx.3,o.idx] == 0 ) |
                       (df[idx.3,n.idx] == 0 & df[idx.3,o.idx] > 0 ), TRUE,

I had to change my calling routine to be:

doubles.df <- matrix(rep(FALSE,2*lanes),ncol=lanes)
   doubles.df <- rbind(doubles.df,look.for.problems.5(df.mi.input))

Et voila the runtime:

##  user  system elapsed
## 0.277   0.000   0.276
## that is really fast!

So. Lesson learned, painfully after a lot of struggle. Avoid loops, and your code will run more than 250 times faster. Which is great. Now running a year of data for one sensor takes 30 seconds on my workstation, which is good because all this step is supposed to do is get rid of really obvious outliers.


One thought on “R. Struggle with it and it becomes clear.

  1. Pingback: RJSONIO doesn’t handle NA, but gsub does « Contour Line

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s