RJSONIO to process CouchDB output

I have an idea.  I am going to process the 5 minute aggregates of raw detector data I’ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO (from http://www.omegahat.org/).  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events.

Update, this works great.  Except it reveals that my JSON structure in CouchDB isn’t so great.  The problem is that I’m dumping JSON objects per line.  For example:

 ["1201044", "00:00:00", "Fri", "12"]:{N:8,O:0.001782, Pct:1, lanes: 5, intrvls: 10}

While that looks great on paper, and logically makes sense if you think about pulling a single record, it doesn’t work so well when you process lots of records. While RJSONIO is pretty darn good, it certainly isn’t a mind reader, and it cannot turn a list of such objects into a matrix or data frame without some help. If you just throw the results of the RCurl fetch at RJSONIO, you get the following:

> demo=fromJSON(data)
> demo$rows[1]
[1] "1202024" "17:35:00" "Fri" "12"

[1] 427

[1] 0.04861833

[1] 1

[1] 6

[1] 10


In words, what that means is that the CouchDB response of {rows:[...]} is parsed as a labeled list by R, so the response is a list with one element, rows, which contains n elements each with an element key which is a list of character vectors, and another element value, which itself is a list containing several named elements N, O, Pct, lanes, intrvls. I couldn’t figure out a quick way to make R figure out that I wanted a data.frame with named entries for each of the key terms and each of the value terms (9 columns by n rows). Many more gray hairs later, I remembered about unlist and got stuff sorted. Here is my suboptimal R script for the next time I take a long break from using R and can’t remember the syntax anymore.

#parameters: month,id,fivemin
id=1202024  ## randomly chosen
# get every month in parallel.  RCurl is cool that way
couchdb = "http://localhost:5984/"
db = paste("d12_2007_",month,"morehash/_design/summary/_view/fivemin?",sep="")
moreurl = paste("group=true&startkey=[\"",id,"\",\"",fivemin,":00\"]&endkey=[\"",id,"\",\"",fivemin,":01\"]",sep="")
uri=paste(couchdb,db,moreurl,sep="");  ## 12 different URIs to fetch
data = getURL(uri)
## make a list to store data temporarily on the first pass
for(i in 1:length(data)){
  ## parse each month in turn
  jsondata = fromJSON(data[[i]])
  ## unlist flattens the R object
## make the list of flattened R objects into a matrix
## by unlisting again, and specifying that I'm expecting 9 columns
dmatrix = matrix(data=unlist(d1),ncol=9,byrow=TRUE)
## finally, make a dataframe explicitly labeling each column as needed and converting to numeric from text
d2= data.frame(id=dmatrix[,1],

Next up is the actual bootstrapping of interesting statistics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.