Millions of databases

One of the problems I’ve been having with couchdb is the large number of documents I need to store. My solution was to split up the databases along the time dimension and along districts, so that I have one database per (month * year * district). That makes things manageable for my head, but is less than trivial for a program to sort out.

On the other hand, most of my uses involve putting back together those 12 months in the year. So that suggests that this sharding approach is a bad one. I played around with merging up 12 months into one database, but because the view trees aren’t merged but rather have to be rebuilt, that didn’t make any sense at all, as I posted last week. Using node.js to pre-compute the views that aren’t generated yet works and is lots of fun, but not something I want to maintain beyond an interesting test of node.js.

Then today while browsing the CouchDB users mailing list, J Chris Anderson posted that “CouchDB has been tested with millions of databases on a single server, no problem, so this model is practical and supported.”

Click went my low-wattage brain, and I realized that I could be generating one database per sensor, rather than merging all the sensors into a single database. There are only a few thousand detectors at most (a lot less than millions) and this would help to reduce the size of each database such that more complicated views would become bearable. Even if I combine sensors into related groups (mainline, ramp, HOV, etc), that still would produce a much smaller database.

And the fit is better with our usage patterns. We often want to drill down or produce aggregates for a particular sensor location or even a particular lane, but it is only very rarely that we want to compare data across sensors. And when we do, it is usually something best handled in R rather than JavaScript.