So I finally spent an hour playing with node.js.
I did some hacking with node.js and my CouchDB application for processing traffic detector information. I have 12 databases, each holding one month of data. For technical reasons, it is not workable to merge those 12 databases into one, and anyway in the end I’ll have 144 (one for each Caltrans district if all goes well) for each year of data. So it makes sense to start sharding things up now in a rational manner. A month of data gets processed pretty quickly by my view generator.
The problem comes in with querying the view, and merging the results. Ordinarily, one would want to ask something like “what is the average volume and occupancy of the traffic at this node every five minutes from 8am to 12am on a weekday over the last 12 months. To answer that question, I need to merge the output of 12 different views. While each view response comes up lickety split because couchdb is awesome like that, I either have to put the merging burden on the client, or else come up with a merging solution on my own.
With node.js, however, my brain is doing the right thing. I just get it. I read the documentation for the node couchdb library I found on github, and saw that it had one event queue per couchdb client. So I wired up 12 clients, connected each to one of my 12 databases, wrote a loop to fire off 12 requests, each with a callback that updated a global hash value.
The results were very pretty. The output was properly aggregated, and the response was just a little bit slower than just running a single query for a single month:
real 0m0.320s user 0m0.067s sys 0m0.007s
real 0m0.179s user 0m0.047s sys 0m0.013s
I suspect most of the time was spent sending the larger data set down, rather than much time computing views or responding to requests.
So this brings me to another problem I’ve been trying to solve. What I really need are not the average traffic flow stats, but the average and standard deviation. But computing that in a view that is numerically stable is more challenging. While a simple summing up operation finished up in 2 days, computing the standard deviation and average for the 12 months is has been running for 3 days and is only about a third of the way there, with the views currently hitting about 16G.
Perhaps a better way that might be possible with an external server like node is to generate the reduce function on request. I’ve already computed the map function I’m interested in. All I have to do not ask for the reduce, and then I can play with the views on my own. I miss out on caching the results, of course, but on the other hand, I can do this while I wait for the real view to finish, and switch over when it is done.
Furthermore, since the reduce approach already requires that reduce can also re-reduce, I can run all 12 calls independently of each other in their own callbacks, and then fire off a 13th callback when they are all done to merge the 12 results by calling the same reduce function.
Of course, I haven’t done any of this yet, but I have high hopes.