my first non-trival reduce for couchdb

Update:  I posted up a cleaner version of this to the CouchDB wiki at http://wiki.apache.org/couchdb/View_Snippets

So.  I need to compute the standard deviation.  I didn’t trust jchris’ couchdb reduce example, so I decided to dig through google and find (again) the accepted on-line way to compute standard deviation (and other moments).

All in all  a pretty interesting search.  There is a great, free programming book available from MIT.  All about lists.  That didn’t help.  Then I found the above referenced example in the couchdb mailing lists, as well as another that pointed to a java library.  I looked at that, and noted that it cited Knuth’s art of computer programming.  So that was good.  Then I did another google search and eventually looked at the Wikipedia entry (which was typically pretty sloppy) but which atypically had a decent reference or two.  Both of those references were on-line too, so I eventually worked up two algorithms to compute the second moment.

I did two, because the output should match, no?  and it does.  So they must be correct!  (Good old CRA data management strategies coming into play there).

First the raw reduce code, cut from the Futon window:

function (keys, values, rereduce) {

    // algorithm for on-line computation of moments from
//
//    Tony F. Chan, Gene H. Golub, and Randall J. LeVeque: "Updating
//    Formulae and a Pairwise Algorithm for Computing Sample
//    Variances." Technical Report STAN-CS-79-773, Department of
//    Computer Science, Stanford University, November 1979.
    // url: ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
    // so there is some wierdness in that the original was Fortran, index from 1,
    // and lots of arrays (no lists, no hash tables)

    // also consulted http://people.xiph.org/~tterribe/notes/homs.html
    // and http://www.jstor.org/stable/2683386
    // and (ick!) the wikipedia description of Knuth's algorithm
    // to clarify what was going on with http://www.slamb.org/svn/repos/trunk/projects/common/src/java/org/slamb/common/stats/Sample.java

    function combine_S(current,existing,key){
	if(!key){key='risk';}
	var NS=current.S;
	var NSum=current.Sum;
	var M = existing.M;
	if(!M){M=0;}
	if(M>0){
	    var diff =
		((current.M * existing.Sum / existing.M) - current.Sum );

	    NS += existing.S + existing.M*diff*diff/(current.M * (current.M+existing.M) );
	    NSum += existing.Sum ;
	}
	return {'S':NS,'Sum':NSum, 'M': current.M+M };
    }

    function pairwise_update (values, M, Sum, S, key){
	if(!key){key='risk';}
	if(!Sum){Sum = 0; S = 0; M=0;}
	if(!S){Sum = 0; S = 0; M=0;}
	if(!M){Sum = 0; S = 0; M=0;}
	var T;
	var stack_ptr=1;
	var N = values.length;
	var half = Math.floor(N/2);
	var NSum;
	var NS ;
	var SumA=[];
	var SA=[];
	var Terms=[];
	Terms[0]=0;
	if(N == 1){
	    Nsum=values[0][key];
	    Ns=0;
	}else if(N > 1){
	    // loop over the data pairwise
	    for(var i = 0; i < half; i++){
		SumA[stack_ptr]=values[2*i+1][key] + values[2*i][key];
		var diff = values[2*i + 1][key] - values[2*i][key] ;
		SA[stack_ptr]=( diff * diff ) / 2;
		Terms[stack_ptr]=2;
		while( Terms[stack_ptr] == Terms[stack_ptr-1]){
		    // combine the top two elements in storage, as
		    // they have equal numbers of support terms.  this
		    // should happen for powers of two (2, 4, 8, etc).
		    // Everything else gets cleaned up below
		    stack_ptr--;
		    Terms[stack_ptr]*=2;
		    var diff = SumA[stack_ptr] - SumA[stack_ptr+1];
		    SA[stack_ptr]=  SA[stack_ptr] + SA[stack_ptr+1] +
			(diff * diff)/Terms[stack_ptr];
		    SumA[stack_ptr] += SumA[stack_ptr+1];
		} // repeat as needed
		stack_ptr++;
	    }
	    stack_ptr--;
	    // check if N is odd
	    if(N % 2 !=  0){
		// handle that dangling element
		stack_ptr++;
		Terms[stack_ptr]=1;
		SumA[stack_ptr]=values[N-1][key];
		SA[stack_ptr]=0;
	    }
	    T=Terms[stack_ptr];
	    NSum=SumA[stack_ptr];
	    NS= SA[stack_ptr];
	    if(stack_ptr > 1){
		// values.length is not power of two, handle remainders
		for(var i = stack_ptr-1; i>=1 ; i--){
		    var diff = Terms[i]*NSum/T-SumA[i];
		    NS = NS + SA[i] +
			( T * diff * diff )/
			(Terms[i] * (Terms[i] + T));
		    NSum += SumA[i];
		    T += Terms[i];
		}
	    }
	}
	// finally, combine NS and NSum with S and Sum
	return 	combine_S(
	    {'S':NS,'Sum':NSum, 'M': T },
	    {'S':S,'Sum':Sum, 'M': M });
    }

    var output={};
    if(!rereduce)
    {
	output = pairwise_update(values);
      var mean = values[0].risk;
      var min = values[0].risk;
      var max = values[0].risk;
      var M2 = 0;
	for(var i=1 ; i<values.length; i++){
	    var diff = (values[i].risk - mean);
            var newmean = mean +  diff / (i+1);
            M2 += diff * (values[i].risk - newmean);
            mean = newmean;
            min = Math.min(values[i].risk, min);
            max = Math.max(values[i].risk, max);
        }

	output.min=min;
	output.max=max;
	output.mean=mean;
	output.M2=M2;
	output.variance_n=M2/values.length;
	output.variance_nOtherWay=output.S/output.M;
	output.mean_OtherWay = output.Sum/output.M;
        output.n = values.length;

    } else {
	/*
           we have an existing pass, so should have multiple outputs to combine
        */
       var mean = 0;
       var min = Infinity;
       var max = -Infinity;
       var M2 = 0;
       var n = 0;
	for(var v in values){
	    output = combine_S(values[v],output);
            var newn = n + values[v].n;
            var newmean = ( n*mean + values[v].n*values[v].mean)/newn;
            min = Math.min(values[v].min, min);
            max = Math.max(values[v].max, max);
            var diff = values[v].mean - mean;
            newmean2 = mean + diff*(values[v].n/newn)
            M2 += values[v].M2 + (diff * diff * n * values[v].n / newn );
            n=newn;
            mean=newmean2;
	}
	output.min=min;
	output.max=max;
	output.mean=mean;
	output.M2=M2;
	output.variance_n=M2/n;
	output.variance_nOtherWay=output.S/output.M;
	output.mean_OtherWay = output.Sum/output.M;
	output.n = n;

    }
    // and done
    return output;
}

And there you have it.  The input as you might be able to tell from the code, is an object with “risk” as the pertinent bit of information getting reduce.  The above code is very difficult to read, but I wanted to get it up now in case I never get back to it.

My goal is to split the code into two, and time them both on really large sets of data, then use the best one and delete the other.  Another reason to get it up somewhere.

I like the non-Knuth version, even though it is pretty tortured in its use of stacks and so on, because the authors go on at length about how much better it is to use their pairwise algorithm, both numerically and from a storage point of view.  Also it lends itself to tacking on the other code for computing higher order moments.  But I think I will keep the min/max stuff from the java code, as it could be useful.

Oh, to close a loophole, the reason I didn’t trust the code from the couchdb-reduce-example is that I didn’t see immediately what I was looking for.  On closer inspection, it probably does the same thing as my code.  Hmm, yes, for completeness, I am going to plug in this third way and check if it give the same results.  And indeed it does:

{"S": 1276.8988123975391, "Sum": 1257.4497350063907, "M": 955, "min":
0.033031734767263086, "max": 6.011336961717487, "mean":
1.3167012932004087, "M2": 1276.898812397539, "variance_n":
1.3370668192644386, "variance_nOtherWay": 1.3370668192644388,
"mean_OtherWay": 1.3167012932004092, "n": 955, "stdDeviation":
1.1563160550923948, "count": 955, "total": 1257.4497350063905,
"sqrTotal": 2932.5845046149643}

As long as you believe that sqrt(1.3370668192644386) is about 1.1563160550923948.  Oh, and on rereading the rationale behind the  Chan, Golub, and  LeVeque paper they state:

This problem is sometimes avoided by use of the following textbook algorithm, so called because, unfortunately, it is often suggested in statistical textbooks:

S = Sum(x**2)  – (1/N) ( Sum(x) ) **2

[[ the equation used in the github example ]]

This rearrangement allows S to be computed with only one pass through the data, but the computation may be numerically unstable and should almost never be used in practice. This instability is particularly troublesome when S is very small.

For the record, I did not notice any problems so far on my test data, but I certainly have run into numerical stability problems in other cases on the more general data set, so I will probably stick with the other algorithms.

About these ads

3 thoughts on “my first non-trival reduce for couchdb

  1. Pingback: links for 2009-09-30 « Concurrent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s