Bootstrap in a view

Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I’ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can’t do that—my input data is too big.  So I have to pick a smaller n.  So I’m going with 1,000 for starters, and repeat the draw 10,000 times.

(There’s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I’m not going to dive into that yet.)

So how to draw 1,000 random samples/documents from a CouchDB database and repeat that process 10,000 times efficiently?    My first inclination was to use the map part of the view, but I suspect that this will fail.  My reading of the docs is that themap must produce a consistent output given the same input. So if the map includes a hard-coded list of samples, great.  But I’m not so sure the map function is allowed to produce this random list on invocation—then the map would not be consistent between runs, which would violate all kinds of assumptions I’m sure.

So the next approach would be to sample all the data, but pull a random subsample with replacement when querying the view.  So in this case, all of the data is used to generate the statistic in question (mean, median, whatever), plus confidence bounds, and then the query does the sampling with replacement.

To do this, my query program (it’d have to be programmatic) would first ping the db and get the docids, then randomly draw from those with replacement k times.  Each combination of ids would then get sent to the db using the post  {“keys”: ["key1", "key2", ...]} api  semantic.

But that might not work for two reasons.  First, I’m not so sure that the post will allow duplicate keys, although I guess this is easy enough to test.  Second, this still requires the complete computation of the map/reduce view function over the entire data set.  I already know this takes a long time and uses a lot of space, so I don’t really want to do that if I don’t have to.  Still, this approach seems pretty good.

The last option I’m thinking about is to do the sampling in the reduce function, not in the map.  There was a response to one of my questions on the mailing list that if a  view is byte-identical to another view, then it is only computed once.  So with that in mind, I could write multiple reduce functions paired to identical map functions that subsampled the available docs in different ways.  But I think this will just turn out to be slower in the end than putting the hardcoded sampling scheme into the view function. The reason I say this is because the map is compiled with every application, whereas the view is just compiled once.  So parsing and loading a long hard-coded list of samples to keep (with possible multiple samples) is going to more time because it is done multiple times.  And I’d need to either keep 10,000 different reduce functions, all with a unique sampling list, or else generate one monster reduce function that simultaneously computes output on 10,000 possible samples.  I see the input set for the reduce job getting sent down each pipe in the matrix, and either being kept, processed, or processed multiple times on its way out the other end, with the final reduce step producing the final 10,000 statistic(s) and confidence bounds.   I’m thinking of a bit-mask index matrix like in R, but maybe it would be more efficient to make it a hash map, with an integer value representing the number of times to include the doc (because resampling means there can be multiples and all that).

The final option that I am *not* going to consider is to use a temporary view.  From the couchdb wiki: “Temporary views are not stored in the database, but rather executed on demand.”   So at first this sounds pretty good, but I think it is misleading.  I’d expect that in a proper bootstrap sampling, most of the observations will get sampled at least once.  In that kind of a situation, I think the second option (building a simple statistic map/reduce view and then sampling it on query) is best, because the computations are cached.  Assuming of course that the map part is non trivial, and the reduce part is relatively fast.

Anyway, plenty to work on this weekend…as always, real tests of these ideas will reveal the truth.

Update:  I just tried some things out, and getting clued in a little bit more.  The keys for the post are keys emitted by the map/reduce output, *not* the input documents.  My current map/reduce view stratifies by time of day, risk prediction as the key.  If I instead want random draws over all observations, my output is going to have to be keyed by the document_ids.  That is pretty unworkable, I think, unless I have separate views for each time period of interest, each risk prediction, and emit keys that are the doc id.  Again, I’ll have to try it and see.  But option 1 is moving up as the most likely anyway, especially as I just read  today about balanced resampling to reduce variance in bootstrap methods in Chernick’s 2008 book, Bootstrap Methods.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s