R ifelse tripped me up

In the continuing saga of climbing the R learning curve, I just found a bug in older code.

Although in hindsight I can’t see what on earth I was thinking.

I have been using ifelse as sort of a replacement for case statements. ifelse is cool because if you have a list of things with a handful of values or that you want to split on a value, you can write an ifelse statement that works on the whole list. So for example:

## assume ts is a POSIXlt object from January 1, 2008 through January 1, 2009
saturdays <- ifelse(ts.wday==6,saturdayvalue,otherdayvalue)

This isn’t a great example, because you can accomplish the same thing by indexing. There are cases when it is much more useful than indexing.

However, I made the mistake of thinking that the ifelse operator knows what its target is (what context it is operating in), but instead the operator only looks at it arguments, not its expected result. So I did something like:

> testCondition <- 2
> list.of.data <- 1:10

> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> target.list <- ifelse(testCondition==1,list.of.data,list.of.data*2)

I expected target.list to be a list of 10 items, either doubled or not, depending on the value of testCondition. In fact, I just got one item

> target.list
[1] 10

But, if you make a list condition, rather than a scalar, you get a list result. So consider this

> ## ifelse needs a list condition to generate a list output
> testCondition <- rep(1:3,5)
> testCondition
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> target.list <- ifelse(testCondition==1,list.of.data,rev(list.of.data))
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6

Which is an utterly crazy result. Even now, when I think I’m writing an example of what ifelse does, I was expecting a list of lists, with reversed lists where the condition was true. But no, ifelse generates as a result the same thing that it has for its condition…a simple vector. Looking again at the code, what that ifelse command did is that every time the testCondition vector hit 1, it drew from the normally ordered list, and every time it hit 2 or 3, it drew from the reversed the list.of.data variable, keeping the index in the vector going. So you start with a normal list at the first element, get 2 and 3 from the reversed list elements 2 and 3, then pull element 4 from the forward list at the 4th element, etc., then recycle both lists when you hit element 11 in the output vector.

> rev(list.of.data)
[1] 10  9  8  7  6  5  4  3  2  1
> list.of.data
[1]  1  2  3  4  5  6  7  8  9 10
> testCondition==1
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[13]  TRUE FALSE FALSE
> target.list
[1]  1  9  8  4  6  5  7  3  2 10 10  9  3  7  6
> 

So the result is entirely based on the test value, not the expected result, or the two true or false answers. If you have a single scalar for your test, then you’ll get a scalar for your answer. If you have a list of 15 elements for your test, you’ll get a list of 15 elements for your answer, with the two conditions being recycled as and if necessary. And you can generate truly weird results if you don’t think about what you’re doing.

R. Struggle with it and it becomes clear.

Been using R almost exclusively for the past few weeks. I’ve always liked R, but I find that the syntax and style maddeningly slow to ingest. Perhaps everybody is like this, but I’ve found that some programming language idioms I take to pretty readily (JavaScript and Perl), some I hate (Java before generics and Spring IOC was odious, after it is at least tolerable), and others I just have to fight through a few weeks of doing things utterly wrong.

R falls in that last camp, but since I used to be pretty good at it back when I was working on my dissertation, I’ve always considered it to be my goto stats language. So now that I have a major deliverable due, and it really needs more advanced statistics than the usual “mean/max/min/sd” one can usually throw at data, I’ve taken the plunge back into R syntax once again.

I’m building up scripts to process massive amounts of data (massive to me, perhaps not to Google and Yahoo, but a terabyte is still a terabyte), so each step of these scripts has to be fast. So periodically I come across some step that is just too slow, or something that used to be fast but that slows down as I add more cruft and throw more data at it, it bogs down.

Here is an example of how R continues to confound me even after 3 weeks of R R R (I’m a pirate, watch me R). Continue reading