Way too much cool stuff going on

So I finally spent an hour playing with node.js.

I did some hacking with node.js and my CouchDB application for processing traffic detector information. I have 12 databases, each holding one month of data. For technical reasons, it is not workable to merge those 12 databases into one, and anyway in the end I’ll have 144 (one for each Caltrans district if all goes well) for each year of data. So it makes sense to start sharding things up now in a rational manner. A month of data gets processed pretty quickly by my view generator.

The problem comes in with querying the view, and merging the results. Ordinarily, one would want to ask something like “what is the average volume and occupancy of the traffic at this node every five minutes from 8am to 12am on a weekday over the last 12 months. To answer that question, I need to merge the output of 12 different views. While each view response comes up lickety split because couchdb is awesome like that, I either have to put the merging burden on the client, or else come up with a merging solution on my own.

Enter node.js. The cool think about node.js is that it uses JavaScript. I’m used to writing event-based programming in js for web apps. I don’t ordinarily write threaded or event based programming on the server side (while I should, I know). I’ve often looked at all of the different event approaches for perl, and I know it is possible to write threaded Java, but I’ve never really done anything with either of them. My usual solution is to write multiple parallel jobs and spawn them all. Which actually works pretty well as long as you keep an eye on top.

With node.js, however, my brain is doing the right thing. I just get it. I read the documentation for the node couchdb library I found on github, and saw that it had one event queue per couchdb client. So I wired up 12 clients, connected each to one of my 12 databases, wrote a loop to fire off 12 requests, each with a callback that updated a global hash value.

The results were very pretty. The output was properly aggregated, and the response was just a little bit slower than just running a single query for a single month:

real	0m0.320s
user	0m0.067s
sys	0m0.007s


real	0m0.179s
user	0m0.047s
sys	0m0.013s

I suspect most of the time was spent sending the larger data set down, rather than much time computing views or responding to requests.

So this brings me to another problem I’ve been trying to solve. What I really need are not the average traffic flow stats, but the average and standard deviation. But computing that in a view that is numerically stable is more challenging. While a simple summing up operation finished up in 2 days, computing the standard deviation and average for the 12 months is has been running for 3 days and is only about a third of the way there, with the views currently hitting about 16G.

Perhaps a better way that might be possible with an external server like node is to generate the reduce function on request. I’ve already computed the map function I’m interested in. All I have to do not ask for the reduce, and then I can play with the views on my own. I miss out on caching the results, of course, but on the other hand, I can do this while I wait for the real view to finish, and switch over when it is done.

And that is the beauty of node.js. It is javascript, so the map and reduce code that I write in CouchDB can be plopped more or less as is into node.js. Write a little switch at the top to check if the view is complete, and if not, get the values from a parallel view (same map, no reduce), and compute the requested reduction on them.

Furthermore, since the reduce approach already requires that reduce can also re-reduce, I can run all 12 calls independently of each other in their own callbacks, and then fire off a 13th callback when they are all done to merge the 12 results by calling the same reduce function.

Of course, I haven’t done any of this yet, but I have high hopes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.