Just a quick post so that I remember to elaborate on this later. I have found that whenever I have a large project to do in CouchDB I go through several iterations of designing the documents and the views.
My latest project is typical.
- First design was to push in really big documents. The idea was to run map reduce copy the reduce output to a second db, and map reduce that for the final result. But the view generation was too slow, I never got around to designing the second db, and the biggest documents triggered a bug/memory issue.
- The second design was to push in really small documents. I had an insight, and the small documents were better designed that the rows of the first version’s megadocuments. But I was generating over 90 million docs spread over four databases. The view took even longer to generate than the first go around, and CouchDB complained that my reduce wasn’t reducing fast enough. I turned off that warning and soldiered on, but gave up after 24 hours and 1% view generation.
- The third design collected 100 documents from design 2 into a single doc, and used essentially the same map/reduce (accounting for the slight difference in the document design). Still had the “not reducing fast enough warning. The idea is to reduce the marshalling of docs into and out of JSON by a factor of 100.
- The fourth design switched to 50 docs rather than 100, because bulk uploading was crashing CouchDB repeatedly with the larger doc size. I also figured out that I did not need the reduce step, as my client code can handle aggregating 4 or 5 documents per key without issue.
- The fifth design changed the id of the doc to match the sha256 hash of the data, so that I wouldn’t duplicate data uploads.
- The final design made sure that the timestamps of the original data were sorted prior to generating the 50 observation-long documents, so that I was guaranteed to always get the same output from the same input raw file, and therefore generate the same hash key/doc id, and therefore not upload data twice. Plus code optimization attempts to the view JS.
A few of those iterations were done on really small databases with just a few documents, but some of the problems only cropped up when I had my full set of data getting processed. In the end however, I now have my data stored and ready to read and it now loads much faster than the equivalent data in PostgreSQL.
Finally, contrary to my old idea that a view is like a SQL query, I now think of it as a far more expressive version of a SQL index. Like indexes, very simple views that generate a sorting of the data are more generally useful, but unlike a PostgreSQL index, it is possible to write exactly what you need for an exact query, and make it run super duper fast.