Closer, but still not yet there using bootstrap sampling in Couchdb. My prior post was mostly thinking out loud. I’ve tried some things since, and this post is an attempt to organize my thoughts on the topic.
The first thing I tried was to submit a list of document ids to a view, and see what happened. This might work, and it might not. It certainly won’t work as I expected. That is, I have to do a very flat view—the map has to emit doc._id, value, and the reduce has to compute the statistic of interest over all of the input values. I haven’t tried this yet, but my guess is that this will simply requery the reduce view with all of the input values. So not time is saved by CouchDB’s caching of views.
Another approach is to put a random sequence in the view and sample from that. The problem there is that I need to recompute the view everytime. Using external programs, I will have to query the db for the list of docids, sample those with replacement to build my bootstrap sample, then create a view and submit it to the one-off view processor. Given that the view can’t be cached anyway, the performance hit for this approach will always be paid, so no big deal not having a cached view. Still, it would be nice to not have to rewrite the view every time I want to use it just because the database has grown.
Another approach that I am thinking about now is to save my data differently. Instead of saving as one document the output of the next observation’s computations, instead collect those results into an entire day’s worth of data, and stuff the db with that. Unfortunately, I’ll also have to rewrite my java code, as at the moment I am grabbing a few hours of time across all detectors. Instead I’ll need to grab a day’s data across a single detector.
Which brings up another topic I am still unsure about. Most of the bootstrap references talk about sampling n times from an original sample of size n. That is, if it is 1,000 observations, each sample has 1,000 observations. There is some discussion in Chernick’s book on about p 178 or so about using m out of n sampling, that is, drawing a smaller sample than n. The rule is pretty vague, something about m being on the same order as n, but increasing at a smaller rate than n. So as n goes to infinity, so does m, but m/n goes to zero. That is really broad, and I need to get a better source and/or try it out for myself. Anyway, it seems like log(n) would fit this rule, but would be a terribly small sample.
The point of using the m of n sample is to reduce the impact of outliers or a fat tail. I do have outliers in my data, so it makes sense to use it. I guess the best solution is to test it versus, say, the balanced sampling approach (b random permutations of the n observations sampled b times), and then inspect the differences in the resulting bias and variance for both estimates.