Close still doesn’t count …

… except for nukes and bocci.

I can *almost* make bootstrapping work, but not entirely within couchdb.  I am going to have to do external processing.  Which is probably fine. 

Anyway, here’s where I am so far.  I am loading up one database per detector, with documents that look like:

{
   "_id": "40130160",
   "_rev": "1-1446962830",
   "CV_OCC_1": "0.29665",
   "LAG1_OCC_1": "-0.02416",
   "CORR_OCC_1M X LAG1_OCC_R": "0.07549",
   "Severity--PDO": "0.46376",
   "LAG1_OCC_R": "0.15293",
   "left lane accident": "0.055",
   "2 veh accident": "0.22823",
   "vdsid": "1203692",
   "SD_VOL_M": "3.16066",
   "EstimateTime": "2008-02-24T18:00:00-0800",
   "CORR_VOL_1R": "0.45782",
   "month": "02",
   "1 veh accident": "0.0597",
   "day": "Sun",
   "CV_OCC_R": "0.47489",
   "CORR_VOL_1M": "0.4268",
   "MU_VOL_M": "13.60",
   "CV_OCC_M": "0.26119",
   "LAG1_VOL_M": "-0.07773",
   "CORR_OCC_1R": "0.21826",
   "IntervalSeconds": "1200",
   "MU_VOL_R": "11.625",
   "CORR_VOLOCC_1M": "0.22346",
   "VOL_M": "15.00",
   "LAG1_OCC_M": "-0.00962",
   "VOL_1": "20.00",
   "OCC_1": "0.12778",
   "CV_VOLOCC_1 X CORR_VOLOCC_1M": "0.01314",
   "Severity--Injury": "0.10563",
   "OCC_M": "0.11444",
   "fiveminute": "18:00:00",
   "MU_VOL_1": "13.20",
   "CORR_OCC_1M": "0.4936",
   "CV_VOLOCC_1": "0.05882",
   "SumVol": "1,537.00",
   "CORR_VOLOCC_1R": "0.28207",
   "SD_VOL_R": "3.05243",
   "OCC_R": "0.14",
   "CV_VOLOCC_R": "0.21446",
   "off road accident": "0.0949",
   "CORR_OCC_1M X MU_VOL_M": "6.71293",
   "MuVolocc": "352.40435",
   "LAG1_VOL_R": "-0.14296",
   "VOL_R": "11.00",
   "interior lanes accident": "0.14903",
   "CORR_VOL_MR": "0.41195",
   "CORR_VOLOCC_MR": "0.21033",
   "LAG1_VOL_1": "-0.01163",
   "CORR_OCC_MR": "0.31359",
   "3+ veh accident": "0.09185",
   "CORR_OCC_1M X SD_VOL_R": "1.50667",
   "year": "2008",
   "CV_VOLOCC_M": "0.06651",
   "right lane accident": "0.08891",
   "any accident": "0.38626",
   "SD_VOL_1": "3.59629"
}

Then I have a view with the following map

function(doc) {
    if(doc.year){
	var name="any accident";
	emit(doc._id, doc[name] - 0);
    }
}

and a reduce that is more or less  same as the knuthian mean and variance that I wrote up in an earlier post.  My idea was to do bootstrap sampling by just using the POST {“keys”: [“key1”, “key2”, …]} call documented on the http view api page.  But it doesn’t work, or rather, it works, but the API requires group=true.  So what I get  out is something like:

curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=true' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}' 
{"rows":[
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}},
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}}
]}

If I don’t call it with group=true, I get an error:

curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=false' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}' 
{"error":"query_parse_error","reason":"Multi-key fetches for a reduce view must include group=true"}

So I guess that means if I stick with this approach, I will need to ditch the reduce entirely, and do processing in an external program.

I haven’t yet tried my daily average approach, where a single document contains an entire day.  I don’t expect that m out of n sampling will work, at least not with a random number generator in there, as there is that requirement in the couchdb docs that a view always produce the same output given the same input.  But a balanced approach should work, as long as the permutation process is “pseudo-random” and repeatable for the day.  (Pick any normal number and use that).

Advertisements

2 thoughts on “Close still doesn’t count …

  1. Remember that by default a reduce wants to return you a single aggregate value for the entire view. So without group=true, multi-get doesn’t really make sense as there’s only a single value.

    Also, if I gather right, you could try adding a random number in the range [0, 1] and then in have a map function of emit(doc.random_field * M % N, doc.float_value_of_interest) (You pick M and N to give the approximate number of values per group) and then do your statistics on those collections. I have no idea if the sampling math works there or not, but I don’t know that it doesn’t. :)

  2. Hi thanks for the comment.

    Actually, I want exactly one value! I don’t really want a multi-get, I want a multi-reduce. I want to take those m documents I specify, including the duplicates, and crunch them all through my reduce function, which computes the statistic of interest on the input. I was hoping to be able to use multi-get in this way, since for this application, mapping without reducing isn’t so helpful.

    In really simple terms, bootstrapping is sampling with replacement. So for any m randomly chosen values from a set of n, there is a finite chance that some of those choices will be duplicates. I gather that’s the whole point of the bootstrap method. And that is the part I’m having the most difficulty with in CouchDB’ map/reduce.

    Why is bootstrapping useful? Well, given my data, I can’t justify computing things like mean and variance in the usual way. The problem is that it is quite likely that I have lots of extreme outliers, and I’m also pretty sure my data is truncated at zero and has a really fat tail, but that’s about it. With bootstrap methods, you can get away with knowing nothing, but the price is that the method is often computer-intesive. So instead of computing the mean and variance once on the entire set once, I have to sample from the data set hundreds or thousands of times, with replacement, and compute the mean and variance for each sample, then use those to compute the bootstrap estimate of the mean and variance.

    As to your second suggestion, that just might be a pretty clever idea for tackling the balanced bootstrap, where you just do B copies of the data, randomize the lot, then split it into B subsamples of size n (if I remember correctly). I’m pretty sure as soon as you emit something random, you violate that clause in the view definition that says the map must produce the same outputs given the same inputs. But if I can get around that, say by setting the seed of random each time, that might work pretty well. It also might work well in a temporary view, and frankly given the randomization and multiple sample requirements, it probably doesn’t help at all to have the view cached. But I also don’t want to break my database because some implementation detail relies on that same input same output rule. So I’ll have to test that carefully.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s