<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Contour Line &#187; research</title>
	<atom:link href="http://contourline.wordpress.com/category/research/feed/" rel="self" type="application/rss+xml" />
	<link>http://contourline.wordpress.com</link>
	<description>Surround and define the edges of a subject, giving it shape and volume</description>
	<lastBuildDate>Fri, 13 Nov 2009 17:45:35 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='contourline.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/46bd6fbf3e12066a454c58d20b938584?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Contour Line &#187; research</title>
		<link>http://contourline.wordpress.com</link>
	</image>
			<item>
		<title>RJSONIO to process CouchDB output</title>
		<link>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/</link>
		<comments>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 22:23:03 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=223</guid>
		<description><![CDATA[I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=223&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events.   <span id="more-223"></span>   </p>
<p>Update, this works great.  Except it reveals that my JSON structure in CouchDB isn&#8217;t so great.  The problem is that I&#8217;m dumping JSON objects per line.  For example:</p>
<pre><strong><code> ["1201044", "00:00:00", "Fri", "12"]:{N:8,O:0.001782, Pct:1, lanes: 5, intrvls: 10}</code></strong></pre>
<p>While that looks great on paper, and logically makes sense if you think about pulling a single record, it doesn&#8217;t work so well when you process lots of records.  While RJSONIO is pretty darn good, it certainly isn&#8217;t a mind reader, and it cannot turn a list of such objects into a matrix or data frame without some help.  If you just throw the results of the RCurl fetch at RJSONIO, you get the following:</p>
<p><code><br />
&gt; demo=fromJSON(data)<br />
&gt; demo$rows[1]<br />
[[1]]<br />
[[1]]$key<br />
[1] "1202024"  "17:35:00" "Fri"      "12" </code></p>
<p>[[1]]$value<br />
[[1]]$value$N<br />
[1] 427</p>
<p>[[1]]$value$O<br />
[1] 0.04861833</p>
<p>[[1]]$value$Pct<br />
[1] 1</p>
<p>[[1]]$value$lanes<br />
[1] 6</p>
<p>[[1]]$value$intrvls<br />
[1] 10</p>
<p>&nbsp;</p>
<p>In words, what that means is that the CouchDB response of <code>{rows:[...]}</code> is parsed as a labeled list by R, so the response is a list with one element, <code>rows</code>, which contains <code>n</code> elements each with an element <code>key</code> which is a list of character vectors, and another element <code>value</code>, which itself is a list containing several named elements <code>N, O, Pct, lanes, intrvls</code>.  I couldn&#8217;t figure out a quick way to make R figure out that I wanted a <code>data.frame</code> with named entries for each of the key terms and each of the value terms (9 columns by n rows).  Many more gray hairs later, I remembered about <code>unlist</code> and got stuff sorted.  Here is my suboptimal R script for the next time I take a long break from using R and can&#8217;t remember the syntax anymore.</p>
<pre><code>
#parameters: month,id,fivemin
id=1202024  ## randomly chosen
fivemin="17:35"
# get every month in parallel.  RCurl is cool that way
month=c("01","02","03","04","05","06","07","08","09","10","11","12")
couchdb = "http://localhost:5984/"
db = paste("d12_2007_",month,"morehash/_design/summary/_view/fivemin?",sep="")
moreurl = paste("group=true&amp;startkey=[\"",id,"\",\"",fivemin,":00\"]&amp;endkey=[\"",id,"\",\"",fivemin,":01\"]",sep="")
uri=paste(couchdb,db,moreurl,sep="");  ## 12 different URIs to fetch
data = getURL(uri)
## make a list to store data temporarily on the first pass
d1=list()
for(i in 1:length(data)){
  ## parse each month in turn
  jsondata = fromJSON(data[[i]])
  ## unlist flattens the R object
  d1[[i]]=unlist(jsondata$rows)
}
## make the list of flattened R objects into a matrix
## by unlisting again, and specifying that I'm expecting 9 columns
dmatrix = matrix(data=unlist(d1),ncol=9,byrow=TRUE)
## finally, make a dataframe explicitly labeling each column as needed and converting to numeric from text
d2= data.frame(id=dmatrix[,1],
                      tod=dmatrix[,2],
                      dow=dmatrix[,3],
                      dom=as.numeric(dmatrix[,4]),
                      N=as.numeric(dmatrix[,5]),
                      O=as.numeric(dmatrix[,6]),
                      pct=as.numeric(dmatrix[,7]),
                      lanes=as.numeric(dmatrix[,8]),
                      intervals=as.numeric(dmatrix[,9]))
</code>
</pre>
<p>Next up is the actual bootstrapping of interesting statistics.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/223/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=223&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Close still doesn&#8217;t count &#8230;</title>
		<link>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/</link>
		<comments>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 23:13:13 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=166</guid>
		<description><![CDATA[&#8230; except for nukes and bocci.
I can *almost* make bootstrapping work, but not entirely within couchdb.  I am going to have to do external processing.  Which is probably fine.  
Anyway, here&#8217;s where I am so far.  I am loading up one database per detector, with documents that look like:
{
   "_id": "40130160",
   [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=166&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>&#8230; except for nukes and bocci.</p>
<p>I can *almost* make bootstrapping work, but not entirely within couchdb.  I am going to have to do external processing.  Which is probably fine.  <span id="more-166"></span></p>
<p>Anyway, here&#8217;s where I am so far.  I am loading up one database per detector, with documents that look like:</p>
<pre>{
   "<code class="key">_id</code>": <code class="string">"40130160"</code>,
   "<code class="key">_rev</code>": <code class="string">"1-1446962830"</code>,
   "<code class="key">CV_OCC_1</code>": <code class="string">"0.29665"</code>,
   "<code class="key">LAG1_OCC_1</code>": <code class="string">"-0.02416"</code>,
   "<code class="key">CORR_OCC_1M X LAG1_OCC_R</code>": <code class="string">"0.07549"</code>,
   "<code class="key">Severity--PDO</code>": <code class="string">"0.46376"</code>,
   "<code class="key">LAG1_OCC_R</code>": <code class="string">"0.15293"</code>,
   "<code class="key">left lane accident</code>": <code class="string">"0.055"</code>,
   "<code class="key">2 veh accident</code>": <code class="string">"0.22823"</code>,
   "<code class="key">vdsid</code>": <code class="string">"1203692"</code>,
   "<code class="key">SD_VOL_M</code>": <code class="string">"3.16066"</code>,
   "<code class="key">EstimateTime</code>": <code class="string">"2008-02-24T18:00:00-0800"</code>,
   "<code class="key">CORR_VOL_1R</code>": <code class="string">"0.45782"</code>,
   "<code class="key">month</code>": <code class="string">"02"</code>,
   "<code class="key">1 veh accident</code>": <code class="string">"0.0597"</code>,
   "<code class="key">day</code>": <code class="string">"Sun"</code>,
   "<code class="key">CV_OCC_R</code>": <code class="string">"0.47489"</code>,
   "<code class="key">CORR_VOL_1M</code>": <code class="string">"0.4268"</code>,
   "<code class="key">MU_VOL_M</code>": <code class="string">"13.60"</code>,
   "<code class="key">CV_OCC_M</code>": <code class="string">"0.26119"</code>,
   "<code class="key">LAG1_VOL_M</code>": <code class="string">"-0.07773"</code>,
   "<code class="key">CORR_OCC_1R</code>": <code class="string">"0.21826"</code>,
   "<code class="key">IntervalSeconds</code>": <code class="string">"1200"</code>,
   "<code class="key">MU_VOL_R</code>": <code class="string">"11.625"</code>,
   "<code class="key">CORR_VOLOCC_1M</code>": <code class="string">"0.22346"</code>,
   "<code class="key">VOL_M</code>": <code class="string">"15.00"</code>,
   "<code class="key">LAG1_OCC_M</code>": <code class="string">"-0.00962"</code>,
   "<code class="key">VOL_1</code>": <code class="string">"20.00"</code>,
   "<code class="key">OCC_1</code>": <code class="string">"0.12778"</code>,
   "<code class="key">CV_VOLOCC_1 X CORR_VOLOCC_1M</code>": <code class="string">"0.01314"</code>,
   "<code class="key">Severity--Injury</code>": <code class="string">"0.10563"</code>,
   "<code class="key">OCC_M</code>": <code class="string">"0.11444"</code>,
   "<code class="key">fiveminute</code>": <code class="string">"18:00:00"</code>,
   "<code class="key">MU_VOL_1</code>": <code class="string">"13.20"</code>,
   "<code class="key">CORR_OCC_1M</code>": <code class="string">"0.4936"</code>,
   "<code class="key">CV_VOLOCC_1</code>": <code class="string">"0.05882"</code>,
   "<code class="key">SumVol</code>": <code class="string">"1,537.00"</code>,
   "<code class="key">CORR_VOLOCC_1R</code>": <code class="string">"0.28207"</code>,
   "<code class="key">SD_VOL_R</code>": <code class="string">"3.05243"</code>,
   "<code class="key">OCC_R</code>": <code class="string">"0.14"</code>,
   "<code class="key">CV_VOLOCC_R</code>": <code class="string">"0.21446"</code>,
   "<code class="key">off road accident</code>": <code class="string">"0.0949"</code>,
   "<code class="key">CORR_OCC_1M X MU_VOL_M</code>": <code class="string">"6.71293"</code>,
   "<code class="key">MuVolocc</code>": <code class="string">"352.40435"</code>,
   "<code class="key">LAG1_VOL_R</code>": <code class="string">"-0.14296"</code>,
   "<code class="key">VOL_R</code>": <code class="string">"11.00"</code>,
   "<code class="key">interior lanes accident</code>": <code class="string">"0.14903"</code>,
   "<code class="key">CORR_VOL_MR</code>": <code class="string">"0.41195"</code>,
   "<code class="key">CORR_VOLOCC_MR</code>": <code class="string">"0.21033"</code>,
   "<code class="key">LAG1_VOL_1</code>": <code class="string">"-0.01163"</code>,
   "<code class="key">CORR_OCC_MR</code>": <code class="string">"0.31359"</code>,
   "<code class="key">3+ veh accident</code>": <code class="string">"0.09185"</code>,
   "<code class="key">CORR_OCC_1M X SD_VOL_R</code>": <code class="string">"1.50667"</code>,
   "<code class="key">year</code>": <code class="string">"2008"</code>,
   "<code class="key">CV_VOLOCC_M</code>": <code class="string">"0.06651"</code>,
   "<code class="key">right lane accident</code>": <code class="string">"0.08891"</code>,
   "<code class="key">any accident</code>": <code class="string">"0.38626"</code>,
   "<code class="key">SD_VOL_1</code>": <code class="string">"3.59629"</code>
}</pre>
<p>Then I have a view with the following map</p>
<pre>function(doc) {
    if(doc.year){
	var name="any accident";
	emit(doc._id, doc[name] - 0);
    }
}</pre>
<p>and a reduce that is more or less  same as the knuthian mean and variance that I wrote up in an earlier post.  My idea was to do bootstrap sampling by just using the POST {&#8220;keys&#8221;: ["key1", "key2", ...]} call documented on the <a href="http://wiki.apache.org/couchdb/HTTP_view_API" target="_blank">http view api page</a>.  But it doesn&#8217;t work, or rather, it works, but the API requires group=true.  So what I get  out is something like:</p>
<pre>curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=true' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}'
{"rows":[
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}},
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}}
]}</pre>
<p>If I don&#8217;t call it with group=true, I get an error:</p>
<pre>curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=false' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}'
{"error":"query_parse_error","reason":"Multi-key fetches for a reduce view must include group=true"}</pre>
<p>So I guess that means if I stick with this approach, I will need to ditch the reduce entirely, and do processing in an external program.</p>
<p>I haven&#8217;t yet tried my daily average approach, where a single document contains an entire day.  I don&#8217;t expect that m out of n sampling will work, at least not with a random number generator in there, as there is that requirement in the couchdb docs that a view always produce the same output given the same input.  But a balanced approach should work, as long as the permutation process is &#8220;pseudo-random&#8221; and repeatable for the day.  (Pick any normal number and use that).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/166/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=166&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>More thoughts on using bootstrap</title>
		<link>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/</link>
		<comments>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 17:19:09 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=160</guid>
		<description><![CDATA[Closer, but still not yet there using bootstrap sampling in Couchdb.   My prior post was mostly thinking out loud.  I&#8217;ve tried some things since, and this post is an attempt to organize my thoughts on the topic.
The first thing I tried was to submit a list of document ids to a view, and see what [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=160&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Closer, but still not yet there using bootstrap sampling in Couchdb.   My prior post was mostly thinking out loud.  I&#8217;ve tried some things since, and this post is an attempt to organize my thoughts on the topic.</p>
<p><span id="more-160"></span>The first thing I tried was to submit a list of document ids to a view, and see what happened.  This might work, and it might not.  It certainly won&#8217;t work as I expected.  That is, I have to do a very flat view&#8212;the map has to emit doc._id, value, and the reduce has to compute the statistic of interest over all of the input values.  I haven&#8217;t tried this yet, but my guess is that this will simply requery the reduce view with all of the input values.  So not time is saved by CouchDB&#8217;s caching of views.</p>
<p>Another approach is to put a random sequence in the view and sample from that.  The problem there is that I need to recompute the view everytime.  Using external programs, I will have to query the db for the list of docids, sample those with replacement to build my bootstrap sample, then create a view and submit it to the one-off view processor.  Given that the view can&#8217;t be cached anyway, the performance hit for this approach will always be paid, so no big deal not having a cached view.  Still, it would be nice to not have to rewrite the view every time I want to use it just because the database has grown.</p>
<p>Another approach that I am thinking about now is to save my data differently.  Instead of saving as one document the output of the next observation&#8217;s computations, instead collect those results into an entire day&#8217;s worth of data, and stuff the db with that.  Unfortunately, I&#8217;ll also have to rewrite my java code, as at the moment I am grabbing a few hours of time across all detectors.  Instead I&#8217;ll need to grab a day&#8217;s data across a single detector.</p>
<p>With this method, I *think* I can implement a bootstrap sampling with replacement inside of the javascript.  The rules of the view engine are preserved&#8212;each document is independent of every other, and no reliance on other documents in the database to process a single document.  The map part will sample with replacement from the observations for that day, and then emit as many replicates as I want for the day.</p>
<p>Which brings up another topic I am still unsure about.  Most of the bootstrap references talk about sampling n times from an original sample of size n.  That is, if it is 1,000 observations, each sample has 1,000 observations.  There is some discussion in Chernick&#8217;s book on about p 178 or so about using m out of n sampling, that is, drawing a smaller sample than n.  The rule is pretty vague, something about m being on the same order as n, but increasing at a smaller rate than n.  So as n goes to infinity, so does m, but m/n goes to zero.  That is really broad, and I need to get a better source and/or try it out for myself.  Anyway, it seems like log(n) would fit this rule, but would be a terribly small sample.</p>
<p>The point of using the m of n sample is to reduce the impact of outliers or a fat tail.  I do have outliers in my data, so it makes sense to use it.  I guess the best solution is to test it versus, say, the balanced sampling approach (b random permutations of the n observations sampled b times), and then inspect the differences in the resulting bias and variance for both estimates.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/160/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=160&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Bootstrap in a view</title>
		<link>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/</link>
		<comments>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 19:35:20 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=156</guid>
		<description><![CDATA[Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=156&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Inspired by <a title="MAD skills" href="http://databeta.wordpress.com/2009/03/20/mad-skills/" target="_self">this post</a>, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I&#8217;ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can&#8217;t do that&#8212;my input data is too big.  So I have to pick a smaller n.  So I&#8217;m going with 1,000 for starters, and repeat the draw 10,000 times.</p>
<p>(There&#8217;s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I&#8217;m not going to dive into that yet.)<span id="more-156"></span></p>
<p>So how to draw 1,000 random samples/documents from a CouchDB database and repeat that process 10,000 times efficiently?    My first inclination was to use the map part of the view, but I suspect that this will fail.  My reading of the docs is that themap must produce a consistent output given the same input. So if the map includes a hard-coded list of samples, great.  But I&#8217;m not so sure the map function is allowed to produce this random list on invocation&#8212;then the map would not be consistent between runs, which would violate all kinds of assumptions I&#8217;m sure.</p>
<p>So the next approach would be to sample all the data, but pull a random subsample with replacement when querying the view.  So in this case, all of the data is used to generate the statistic in question (mean, median, whatever), plus confidence bounds, and then the query does the sampling with replacement.</p>
<p>To do this, my query program (it&#8217;d have to be programmatic) would first ping the db and get the docids, then randomly draw from those with replacement k times.  Each combination of ids would then get sent to the db using the post  {&#8220;keys&#8221;: ["key1", "key2", ...]} api  semantic.</p>
<p>But that might not work for two reasons.  First, I&#8217;m not so sure that the post will allow duplicate keys, although I guess this is easy enough to test.  Second, this still requires the complete computation of the map/reduce view function over the entire data set.  I already know this takes a long time and uses a lot of space, so I don&#8217;t really want to do that if I don&#8217;t have to.  Still, this approach seems pretty good.</p>
<p>The last option I&#8217;m thinking about is to do the sampling in the reduce function, not in the map.  There was a response to one of my questions on the mailing list that if a  view is byte-identical to another view, then it is only computed once.  So with that in mind, I could write multiple reduce functions paired to identical map functions that subsampled the available docs in different ways.  But I think this will just turn out to be slower in the end than putting the hardcoded sampling scheme into the view function. The reason I say this is because the map is compiled with every application, whereas the view is just compiled once.  So parsing and loading a long hard-coded list of samples to keep (with possible multiple samples) is going to more time because it is done multiple times.  And I&#8217;d need to either keep 10,000 different reduce functions, all with a unique sampling list, or else generate one monster reduce function that simultaneously computes output on 10,000 possible samples.  I see the input set for the reduce job getting sent down each pipe in the matrix, and either being kept, processed, or processed multiple times on its way out the other end, with the final reduce step producing the final 10,000 statistic(s) and confidence bounds.   I&#8217;m thinking of a bit-mask index matrix like in R, but maybe it would be more efficient to make it a hash map, with an integer value representing the number of times to include the doc (because resampling means there can be multiples and all that).</p>
<p>The final option that I am *not* going to consider is to use a temporary view.  From the couchdb wiki: &#8220;<strong>Temporary views</strong> are not stored in the database, but rather executed on demand.&#8221;   So at first this sounds pretty good, but I think it is misleading.  I&#8217;d expect that in a proper bootstrap sampling, most of the observations will get sampled at least once.  In that kind of a situation, I think the second option (building a simple statistic map/reduce view and then sampling it on query) is best, because the computations are cached.  Assuming of course that the map part is non trivial, and the reduce part is relatively fast.</p>
<p>Anyway, plenty to work on this weekend&#8230;as always, real tests of these ideas will reveal the truth.</p>
<p>Update:  I just tried some things out, and getting clued in a little bit more.  The keys for the post are keys emitted by the map/reduce output, *not* the input documents.  My current map/reduce view stratifies by time of day, risk prediction as the key.  If I instead want random draws over all observations, my output is going to have to be keyed by the document_ids.  That is pretty unworkable, I think, unless I have separate views for each time period of interest, each risk prediction, and emit keys that are the doc id.  Again, I&#8217;ll have to try it and see.  But option 1 is moving up as the most likely anyway, especially as I just read  today about balanced resampling to reduce variance in bootstrap methods in Chernick&#8217;s 2008 book, Bootstrap Methods.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/156/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=156&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Time and space</title>
		<link>http://contourline.wordpress.com/2009/03/10/time-and-space/</link>
		<comments>http://contourline.wordpress.com/2009/03/10/time-and-space/#comments</comments>
		<pubDate>Tue, 10 Mar 2009 16:47:15 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=152</guid>
		<description><![CDATA[It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We&#8217;ve ordered a faster, bigger machine, and that will help speed things up and make [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=152&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>It takes a finite amount of time to process loop data into my database, and the results take up a finite amount of space.  So no matter what, if I process and save results, it will take time and space.  We&#8217;ve ordered a faster, bigger machine, and that will help speed things up and make space less of an issue, but there are more loop detectors to process.</p>
<p>So the presumption is that it is actually *worth* the time and space to compute and store the data.  This isn&#8217;t necessarily the case.  In fact, what I really want access to are the long-term averages of the accident risk values over time.  Going forward, I always want to keep around a little bit of data, but the primary use case is to compare historical averages (sliced and diced in various ways) to the current values.</p>
<p>The problem is that it is difficult to maintain historical trends without keeping the data handy.  As I&#8217;ve said in prior postings and in my notes, I really like how CouchDB&#8217;s map reduce approach allows the generation of different layers of statistics.  By emitting an array as the key, and a predicted risk quantity as the value, the reduce function that computes mean and variance will be run for a cascading tree of the keys.   So just by writing a map with a key like [loop_id,month,day, 15_minute_period], I can ask for averages over all data, over just a single loop, over a loop for a month, over a loop for a month for a particular Monday, etc etc.</p>
<p>On the other hand, this is limiting.  If I change my mind and want to aggregate over days but without splitting out months, or if I want to put a year field in there to evaluate annual variations, I can&#8217;t.  I have to rewrite the map, perhaps using the same view, and the whole shebang has to be recomputed&#8212;not trivial when the input set is about 15G per week.</p>
<p>As CouchDB matures, perhaps it will do a faster job computing views.  The approach is certainly there to parallelize the computations, but at the moment I only see a single process thrashing through the calculations.</p>
<p>Finally, if I delete old data, it isn&#8217;t clear to me how I would still maintain the running computations of mean and variance.  Technically it is possible&#8212;all you have to do is combine partial compuations, knowing the number of observations that fed into each one.  But practically, I have a feeling that when I delete input data, the output will get blown away.</p>
<p>Perhaps the best approach is to maintain couchdb for just a day&#8217;s worth of data, and run a separate postgresql process to store the map reduce output.  Then as couchdb matures, I can eventually store longer and longer time periods, but at all times I have a record of past history.</p>
<p>I think a table storing 5 minute-rounded timestamp, loop id, as the key, and all the different mean, variance, and count values for all of the different risk predictions would be good.  This would then feed higher level aggregation tables (like day, year, and so on).  By keeping the 5 minute mean and variance, I can compute any other variance pretty quickly (average across all loops, average for that day, average for a year of that loop and 5 minute period, etc).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/152/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/152/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/152/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/152/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/152/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/152/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/152/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/152/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/152/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/152/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=152&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/03/10/time-and-space/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>A lot of data is a lot of data</title>
		<link>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/</link>
		<comments>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/#comments</comments>
		<pubDate>Fri, 06 Mar 2009 18:35:42 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=150</guid>
		<description><![CDATA[I can&#8217;t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple&#8212;every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=150&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I can&#8217;t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple&#8212;every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.</p>
<p><span id="more-150"></span>So one db doesn&#8217;t work so well.  There was a posting to the user mailing list asking how many databases people were using.  So I gave that a shot&#8212;trying one db per loop, with the same map/reduce.  The downside is that I can&#8217;t compute averages across loops as before, but that&#8217;s okay because I couldn&#8217;t get the view generation to finish at all.</p>
<p>But I still have problems.  The main couchdb process (beam in top) runs up to 80 or 90 percent CPU usage, and there are lots of javascript child processes split off, but the view computation is still super slow, even on just a few days of data (14 GB uncompressed, including view cache)</p>
<p>I&#8217;m thinking that the only way to avoid this problem is to keep updating the view with every insert into the database.  But I&#8217;m worried that will fall behind real time, let alone allow me to move backwards and process last year&#8217;s data too.  Without any super quantitative measures, it seems from my experience that if I get about a gigabyte behind the curve on computing the cached view, I can&#8217;t keep up&#8212;data loading goes too fast, and index processing never finishes.   Or I get mystery errors like:</p>
<pre>at /usr/lib/perl5/site_perl/5.8.8/i486-linux-thread-multi/Coro.pm line 419
[
  {
    Reason =&gt; "Connection timed out",
    Status =&gt; 599,
    URL =&gt; "http://localhost:5984/safetydb1204650/_view/riskstats2/All",
  },
  undef,
] at ping_couchdbs.pl line 61</pre>
<p>from my program that pings the views for all of the databases.</p>
<p>So maybe for now I need to go back to postgresql.  I do like the map reduce part of Couchdb, and I do like the unstructured doc format, but perhaps it isn&#8217;t so good for massive number crunching yet. But not being able to get a year of data in and a valid annual average out is kind of a show stopper.</p>
<p>But to be fair, I couldn&#8217;t do that in Postgresql either.  My use of Couchdb may help there, as I now have a document centric view of the data, rather than a relational view.  And before I get off the couch (sorry), I still need to look into alternate view servers.  Maybe I can make a view server in C that can run faster than the javascript map/reduce process.   But that will have to wait until next weekend probably while I finish other projects.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/150/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=150&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Truck traffic</title>
		<link>http://contourline.wordpress.com/2009/02/11/truck-traffic/</link>
		<comments>http://contourline.wordpress.com/2009/02/11/truck-traffic/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 00:17:55 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[bikes]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=147</guid>
		<description><![CDATA[Started up a new project recently to estimate traffic flows.  Our first question is to extract truck traffic estimates from those estimates.  For something that costs so much money and is such a large part of the economy, it always surprises me how little accurate information is collected about traffic.  While freeways have reasonable coverage [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=147&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Started up a new project recently to estimate traffic flows.  Our first question is to extract truck traffic estimates from those estimates.  <span id="more-147"></span>For something that costs so much money and is such a large part of the economy, it always surprises me how little accurate information is collected about traffic.  While freeways have reasonable coverage in California, streets are, for the most part, only monitored using periodic census counts.  Doing better than that costs lots of money, and collecting better data probably won&#8217;t make things a heck of a lot better, and *will* expose more about people&#8217;s trips (speed, origin, destination), so nothing will likely happen.</p>
<p>Also, while I&#8217;m not at all trolling for blog comments, I find it funny that people get worked up about automatic speed traps and so on.  Speed limits are usually set at prevailing, safe speeds.  If people are exceeding the speed limit, they are breaking the law and should be ticketed.  If they have a problem with that, they should lobby their elected representatives to get the speed limits raised.   As a bicycler, I&#8217;m usually put in the most danger not by fast moving cars, but rather by people who are exceeding the speed limit.  Those men and women are moving too fast given the conditions on the road to see and react to my bike.  I cycle pretty defensively, so it hasn&#8217;t been a problem yet.  Biking on a road with a fast speed limit usually isn&#8217;t a problem because that means the road is wide and sight lines are good.</p>
<p>I&#8217;m not saying that it isn&#8217;t freaky to bicycle on a road with a speed limit of 50mph.  I&#8217;m just saying that it is much safer than riding on a quite side street by my house (speed limits 25 to 30mph) and having some soccer mom come roaring along at 45mph.</p>
<p>I say track all traffic via GPS in license plates, and mail out speeding tickets on an annual basis when you get your license renewed.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/147/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=147&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/02/11/truck-traffic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>More press for Autonet, this time from Calit2</title>
		<link>http://contourline.wordpress.com/2009/01/28/more-press-for-autonet-this-time-from-calit2/</link>
		<comments>http://contourline.wordpress.com/2009/01/28/more-press-for-autonet-this-time-from-calit2/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 23:29:03 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=142</guid>
		<description><![CDATA[A very nice article on Autonet by Anna Lynn from Calit2 just got posted up today.   I guess if we want to get more funding on this project, the time to strike is now.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=142&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>A <a href="http://www.calit2.net/newsroom/article.php?id=1455" target="_blank">very nice article</a> on Autonet by Anna Lynn from Calit2 just got posted up today.   I guess if we want to get more funding on this project, the time to strike is now.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/142/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/142/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/142/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/142/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/142/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/142/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/142/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/142/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/142/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/142/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=142&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/01/28/more-press-for-autonet-this-time-from-calit2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Trevor&#8217;s Autonet paper published</title>
		<link>http://contourline.wordpress.com/2009/01/14/trevors-autonet-paper-published/</link>
		<comments>http://contourline.wordpress.com/2009/01/14/trevors-autonet-paper-published/#comments</comments>
		<pubDate>Thu, 15 Jan 2009 00:53:29 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=134</guid>
		<description><![CDATA[Trevor&#8217;s Autonet paper finally got published, and we&#8217;ve gotten a small bit of press.  Funny how that works.  Do research and build a prototype.  Write a paper or two or four, apparently get no interest.  Project mostly trickles off.  Then one paper finally gets published by a slower journal, and hey, everybody is interested.
While the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=134&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Trevor&#8217;s Autonet paper finally got published, and we&#8217;ve gotten a <a href="http://tech.slashdot.org/article.pl?sid=09/01/09/2248213" target="_blank">small</a> <a href="http://www.physorg.com/news150543157.html" target="_blank">bit</a> <a href="http://www.networkworld.com/news/2009/010709-p2p-traffic-control.html">of</a> <a href="http://www.sciencecentric.com/news/article.php?q=09010743-wireless-technology-could-reduce-congestion-accidents" target="_blank">press</a>.  Funny how that works.  Do research and build a prototype.  Write a paper or two or four, apparently get no interest.  Project mostly trickles off.  Then one paper finally gets published by a slower journal, and hey, everybody is interested.</p>
<p>While the ideas are good, and while Trevor and his team did a great job with the prototype and got a working system running, I think the real barrier to something like Autonet taking off is the difficulty in getting  a local area wireless connection up and running.  Not from a technical, bit/bytes/hand-off/Doppler-shift point of view.  Rather from a non-technical user&#8217;s point of view.  It is quite difficult to set up a device so that it both blabs and listens on some open wireless channel without requiring careful attention from the user.  Most wifi links, in contrast, are pretty simple to use because there is a defined server and client. But even then most dialogs ask the user to select which host to access, and some require some sort of password or access code.</p>
<p>In the intervening years between working on that stuff and where we are now, we&#8217;ve sort of come to the conclusion that the data channel isn&#8217;t as important as just freeing the information from the automobile.  From the person traveling, really.</p>
<p>The primary advantage of a local area wireless connection is that, well, those cars and devices you can talk probably have data that are relevant to you too, because you&#8217;re all sitting in the same spot.  The local area wireless link acts like a spatial query on the huge mountain of traffic data that is available.  The disadvantage is the need to configure your wireless device in a secure, user friendly way, and needing to develop some sort of protocol to query distant locations.</p>
<p>On the other hand, a cellular link does not have automatic spatial query on the data.  Of course you can *do* a spatial query, but that costs some cpu cycles, whereas with the Autonet idea, you&#8217;re *only* querying geographically proximate neighbors.  You&#8217;ve also got the problem that the wide area wireless links cost money to use.  Cellphone companies are known to charge outrageous rates for data transfer, and in fact, AT&amp;T specifically forbids using their data connection in the manner in which we would *like* to use it.  To quote from their service agreement terms and conditions:</p>
<p style="padding-left:30px;"><strong>Prohibited and Permissible Uses</strong>: Except as may otherwise be specifically permitted or prohibited for select data plans, data sessions may be conducted only for the following purposes: (i) Internet browsing; (ii) email; and (iii) intranet access. &#8230;[T]here are certain uses that cause extreme network capacity issues and interference with the network and are therefore prohibited. Examples of prohibited uses include, without limitation, the following: (i) server devices or host computer applications, including, but not limited to, Web camera posts or broadcasts, automatic data feeds, automated machine-to-machine connections or peer-to-peer (P2P) file sharing; &#8230;</p>
<p>So, an app that automatically uploads location and speed and queries traffic conditions every few seconds is out, but an application that &#8220;browses the internet&#8221; is okay.   So an application that responds to user input to &#8220;browse&#8221; the internet with a heartbeat ping is probably okay, but making it a daemon that bleeps every few minutes is not.</p>
<p>Gotta get us some iPhones so we can test this stuff out, I guess.  Which means we have to get funding.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/134/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=134&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/01/14/trevors-autonet-paper-published/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>my first non-trival reduce for couchdb</title>
		<link>http://contourline.wordpress.com/2009/01/14/my-first-non-trival-reduce-for-couchdb/</link>
		<comments>http://contourline.wordpress.com/2009/01/14/my-first-non-trival-reduce-for-couchdb/#comments</comments>
		<pubDate>Thu, 15 Jan 2009 00:49:45 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=137</guid>
		<description><![CDATA[Update:  I posted up a cleaner version of this to the CouchDB wiki at http://wiki.apache.org/couchdb/View_Snippets
So.  I need to compute the standard deviation.  I didn&#8217;t trust jchris&#8217; couchdb reduce example, so I decided to dig through google and find (again) the accepted on-line way to compute standard deviation (and other moments).
All in all  a pretty interesting [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=137&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Update:  I posted up a cleaner version of this to the CouchDB wiki at <a href="http://wiki.apache.org/couchdb/View_Snippets" target="_blank">http://wiki.apache.org/couchdb/View_Snippets</a></p>
<p>So.  I need to compute the standard deviation.  I didn&#8217;t trust jchris&#8217; <a href="http://github.com/jchris/couchdb-reduce-example/tree/master" target="_blank">couchdb reduce example</a>, so I decided to dig through google and find (again) the accepted on-line way to compute standard deviation (and other moments).</p>
<p><span id="more-137"></span>All in all  a pretty interesting search.  There is a great, free programming book available from MIT.  All about lists.  That didn&#8217;t help.  Then I found the above referenced example in the couchdb mailing lists, as well as another that pointed to a java library.  I looked at that, and noted that it cited Knuth&#8217;s art of computer programming.  So that was good.  Then I did another google search and eventually looked at the Wikipedia entry (which was typically pretty sloppy) but which atypically had a decent reference or two.  Both of those references were on-line too, so I eventually worked up two algorithms to compute the second moment.</p>
<p>I did two, because the output should match, no?  and it does.  So they must be correct!  (Good old CRA data management strategies coming into play there).</p>
<p>First the raw reduce code, cut from the Futon window:</p>
<pre>function (keys, values, rereduce) {

    // algorithm for on-line computation of moments from
//
//    Tony F. Chan, Gene H. Golub, and Randall J. LeVeque: "Updating
//    Formulae and a Pairwise Algorithm for Computing Sample
//    Variances." Technical Report STAN-CS-79-773, Department of
//    Computer Science, Stanford University, November 1979.
    // url: ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
    // so there is some wierdness in that the original was Fortran, index from 1,
    // and lots of arrays (no lists, no hash tables)

    // also consulted http://people.xiph.org/~tterribe/notes/homs.html
    // and http://www.jstor.org/stable/2683386
    // and (ick!) the wikipedia description of Knuth's algorithm
    // to clarify what was going on with http://www.slamb.org/svn/repos/trunk/projects/common/src/java/org/slamb/common/stats/Sample.java

    function combine_S(current,existing,key){
	if(!key){key='risk';}
	var NS=current.S;
	var NSum=current.Sum;
	var M = existing.M;
	if(!M){M=0;}
	if(M&gt;0){
	    var diff =
		((current.M * existing.Sum / existing.M) - current.Sum );

	    NS += existing.S + existing.M*diff*diff/(current.M * (current.M+existing.M) );
	    NSum += existing.Sum ;
	}
	return {'S':NS,'Sum':NSum, 'M': current.M+M };
    }

    function pairwise_update (values, M, Sum, S, key){
	if(!key){key='risk';}
	if(!Sum){Sum = 0; S = 0; M=0;}
	if(!S){Sum = 0; S = 0; M=0;}
	if(!M){Sum = 0; S = 0; M=0;}
	var T;
	var stack_ptr=1;
	var N = values.length;
	var half = Math.floor(N/2);
	var NSum;
	var NS ;
	var SumA=[];
	var SA=[];
	var Terms=[];
	Terms[0]=0;
	if(N == 1){
	    Nsum=values[0][key];
	    Ns=0;
	}else if(N &gt; 1){
	    // loop over the data pairwise
	    for(var i = 0; i &lt; half; i++){
		SumA[stack_ptr]=values[2*i+1][key] + values[2*i][key];
		var diff = values[2*i + 1][key] - values[2*i][key] ;
		SA[stack_ptr]=( diff * diff ) / 2;
		Terms[stack_ptr]=2;
		while( Terms[stack_ptr] == Terms[stack_ptr-1]){
		    // combine the top two elements in storage, as
		    // they have equal numbers of support terms.  this
		    // should happen for powers of two (2, 4, 8, etc).
		    // Everything else gets cleaned up below
		    stack_ptr--;
		    Terms[stack_ptr]*=2;
		    var diff = SumA[stack_ptr] - SumA[stack_ptr+1];
		    SA[stack_ptr]=  SA[stack_ptr] + SA[stack_ptr+1] +
			(diff * diff)/Terms[stack_ptr];
		    SumA[stack_ptr] += SumA[stack_ptr+1];
		} // repeat as needed
		stack_ptr++;
	    }
	    stack_ptr--;
	    // check if N is odd
	    if(N % 2 !=  0){
		// handle that dangling element
		stack_ptr++;
		Terms[stack_ptr]=1;
		SumA[stack_ptr]=values[N-1][key];
		SA[stack_ptr]=0;
	    }
	    T=Terms[stack_ptr];
	    NSum=SumA[stack_ptr];
	    NS= SA[stack_ptr];
	    if(stack_ptr &gt; 1){
		// values.length is not power of two, handle remainders
		for(var i = stack_ptr-1; i&gt;=1 ; i--){
		    var diff = Terms[i]*NSum/T-SumA[i];
		    NS = NS + SA[i] +
			( T * diff * diff )/
			(Terms[i] * (Terms[i] + T));
		    NSum += SumA[i];
		    T += Terms[i];
		}
	    }
	}
	// finally, combine NS and NSum with S and Sum
	return 	combine_S(
	    {'S':NS,'Sum':NSum, 'M': T },
	    {'S':S,'Sum':Sum, 'M': M });
    }

    var output={};
    if(!rereduce)
    {
	output = pairwise_update(values);
      var mean = values[0].risk;
      var min = values[0].risk;
      var max = values[0].risk;
      var M2 = 0;
	for(var i=1 ; i&lt;values.length; i++){
	    var diff = (values[i].risk - mean);
            var newmean = mean +  diff / (i+1);
            M2 += diff * (values[i].risk - newmean);
            mean = newmean;
            min = Math.min(values[i].risk, min);
            max = Math.max(values[i].risk, max);
        }

	output.min=min;
	output.max=max;
	output.mean=mean;
	output.M2=M2;
	output.variance_n=M2/values.length;
	output.variance_nOtherWay=output.S/output.M;
	output.mean_OtherWay = output.Sum/output.M;
        output.n = values.length;

    } else {
	/*
           we have an existing pass, so should have multiple outputs to combine
        */
       var mean = 0;
       var min = Infinity;
       var max = -Infinity;
       var M2 = 0;
       var n = 0;
	for(var v in values){
	    output = combine_S(values[v],output);
            var newn = n + values[v].n;
            var newmean = ( n*mean + values[v].n*values[v].mean)/newn;
            min = Math.min(values[v].min, min);
            max = Math.max(values[v].max, max);
            var diff = values[v].mean - mean;
            newmean2 = mean + diff*(values[v].n/newn)
            M2 += values[v].M2 + (diff * diff * n * values[v].n / newn );
            n=newn;
            mean=newmean2;
	}
	output.min=min;
	output.max=max;
	output.mean=mean;
	output.M2=M2;
	output.variance_n=M2/n;
	output.variance_nOtherWay=output.S/output.M;
	output.mean_OtherWay = output.Sum/output.M;
	output.n = n;

    }
    // and done
    return output;
}</pre>
<p>And there you have it.  The input as you might be able to tell from the code, is an object with &#8220;risk&#8221; as the pertinent bit of information getting reduce.  The above code is very difficult to read, but I wanted to get it up now in case I never get back to it.</p>
<p>My goal is to split the code into two, and time them both on really large sets of data, then use the best one and delete the other.  Another reason to get it up somewhere.</p>
<p>I like the non-Knuth version, even though it is pretty tortured in its use of stacks and so on, because the authors go on at length about how much better it is to use their pairwise algorithm, both numerically and from a storage point of view.  Also it lends itself to tacking on the other code for computing higher order moments.  But I think I will keep the min/max stuff from the java code, as it could be useful.</p>
<p>Oh, to close a loophole, the reason I didn&#8217;t trust the code from the couchdb-reduce-example is that I didn&#8217;t see immediately what I was looking for.  On closer inspection, it probably does the same thing as my code.  Hmm, yes, for completeness, I am going to plug in this third way and check if it give the same results.  And indeed it does:</p>
<pre>{"S": 1276.8988123975391, "Sum": 1257.4497350063907, "M": 955, "min":
0.033031734767263086, "max": 6.011336961717487, "mean":
1.3167012932004087, "M2": 1276.898812397539, "variance_n":
1.3370668192644386, "variance_nOtherWay": 1.3370668192644388,
"mean_OtherWay": 1.3167012932004092, "n": 955, "stdDeviation":
1.1563160550923948, "count": 955, "total": 1257.4497350063905,
"sqrTotal": 2932.5845046149643}</pre>
<p>As long as you believe that sqrt(1.3370668192644386) is about 1.1563160550923948.  Oh, and on rereading the rationale behind the  Chan, Golub, and  LeVeque paper they state:</p>
<p style="padding-left:30px;">This problem is sometimes avoided by use of the following textbook algorithm, so called because, unfortunately, it is often suggested in statistical textbooks:</p>
<p style="padding-left:60px;">S = Sum(x**2)  &#8211; (1/N) ( Sum(x) ) **2</p>
<p style="padding-left:60px;">[[ the equation used in the github example ]]</p>
<p style="padding-left:30px;">This rearrangement allows S to be computed with only one pass through the data, but the computation may be numerically unstable and should almost never be used in practice. This instability is particularly troublesome when S is very small.</p>
<p>For the record, I did not notice any problems so far on my test data, but I certainly have run into numerical stability problems in other cases on the more general data set, so I will probably stick with the other algorithms.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/137/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/137/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/137/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/137/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/137/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/137/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/137/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/137/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/137/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/137/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=137&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/01/14/my-first-non-trival-reduce-for-couchdb/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
	</channel>
</rss>