<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Contour Line &#187; couchdb</title>
	<atom:link href="http://contourline.wordpress.com/category/couchdb/feed/" rel="self" type="application/rss+xml" />
	<link>http://contourline.wordpress.com</link>
	<description>Surround and define the edges of a subject, giving it shape and volume</description>
	<lastBuildDate>Fri, 13 Nov 2009 17:45:35 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='contourline.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/46bd6fbf3e12066a454c58d20b938584?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Contour Line &#187; couchdb</title>
		<link>http://contourline.wordpress.com</link>
	</image>
			<item>
		<title>RJSONIO to process CouchDB output</title>
		<link>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/</link>
		<comments>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 22:23:03 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=223</guid>
		<description><![CDATA[I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=223&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events.   <span id="more-223"></span>   </p>
<p>Update, this works great.  Except it reveals that my JSON structure in CouchDB isn&#8217;t so great.  The problem is that I&#8217;m dumping JSON objects per line.  For example:</p>
<pre><strong><code> ["1201044", "00:00:00", "Fri", "12"]:{N:8,O:0.001782, Pct:1, lanes: 5, intrvls: 10}</code></strong></pre>
<p>While that looks great on paper, and logically makes sense if you think about pulling a single record, it doesn&#8217;t work so well when you process lots of records.  While RJSONIO is pretty darn good, it certainly isn&#8217;t a mind reader, and it cannot turn a list of such objects into a matrix or data frame without some help.  If you just throw the results of the RCurl fetch at RJSONIO, you get the following:</p>
<p><code><br />
&gt; demo=fromJSON(data)<br />
&gt; demo$rows[1]<br />
[[1]]<br />
[[1]]$key<br />
[1] "1202024"  "17:35:00" "Fri"      "12" </code></p>
<p>[[1]]$value<br />
[[1]]$value$N<br />
[1] 427</p>
<p>[[1]]$value$O<br />
[1] 0.04861833</p>
<p>[[1]]$value$Pct<br />
[1] 1</p>
<p>[[1]]$value$lanes<br />
[1] 6</p>
<p>[[1]]$value$intrvls<br />
[1] 10</p>
<p>&nbsp;</p>
<p>In words, what that means is that the CouchDB response of <code>{rows:[...]}</code> is parsed as a labeled list by R, so the response is a list with one element, <code>rows</code>, which contains <code>n</code> elements each with an element <code>key</code> which is a list of character vectors, and another element <code>value</code>, which itself is a list containing several named elements <code>N, O, Pct, lanes, intrvls</code>.  I couldn&#8217;t figure out a quick way to make R figure out that I wanted a <code>data.frame</code> with named entries for each of the key terms and each of the value terms (9 columns by n rows).  Many more gray hairs later, I remembered about <code>unlist</code> and got stuff sorted.  Here is my suboptimal R script for the next time I take a long break from using R and can&#8217;t remember the syntax anymore.</p>
<pre><code>
#parameters: month,id,fivemin
id=1202024  ## randomly chosen
fivemin="17:35"
# get every month in parallel.  RCurl is cool that way
month=c("01","02","03","04","05","06","07","08","09","10","11","12")
couchdb = "http://localhost:5984/"
db = paste("d12_2007_",month,"morehash/_design/summary/_view/fivemin?",sep="")
moreurl = paste("group=true&amp;startkey=[\"",id,"\",\"",fivemin,":00\"]&amp;endkey=[\"",id,"\",\"",fivemin,":01\"]",sep="")
uri=paste(couchdb,db,moreurl,sep="");  ## 12 different URIs to fetch
data = getURL(uri)
## make a list to store data temporarily on the first pass
d1=list()
for(i in 1:length(data)){
  ## parse each month in turn
  jsondata = fromJSON(data[[i]])
  ## unlist flattens the R object
  d1[[i]]=unlist(jsondata$rows)
}
## make the list of flattened R objects into a matrix
## by unlisting again, and specifying that I'm expecting 9 columns
dmatrix = matrix(data=unlist(d1),ncol=9,byrow=TRUE)
## finally, make a dataframe explicitly labeling each column as needed and converting to numeric from text
d2= data.frame(id=dmatrix[,1],
                      tod=dmatrix[,2],
                      dow=dmatrix[,3],
                      dom=as.numeric(dmatrix[,4]),
                      N=as.numeric(dmatrix[,5]),
                      O=as.numeric(dmatrix[,6]),
                      pct=as.numeric(dmatrix[,7]),
                      lanes=as.numeric(dmatrix[,8]),
                      intervals=as.numeric(dmatrix[,9]))
</code>
</pre>
<p>Next up is the actual bootstrapping of interesting statistics.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/223/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/223/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/223/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=223&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Tokyo Tyrant Throwing a Tantrum</title>
		<link>http://contourline.wordpress.com/2009/11/02/tokyo-tyrant-throwing-a-tantrum/</link>
		<comments>http://contourline.wordpress.com/2009/11/02/tokyo-tyrant-throwing-a-tantrum/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 05:32:04 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=220</guid>
		<description><![CDATA[Well, last Friday I posted &#8220;So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.&#8221;
It didn&#8217;t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I&#8217;ve tried lots of hacky things but nothing seems to do the trick.  [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=220&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Well, last Friday I posted &#8220;So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.&#8221;</p>
<p>It didn&#8217;t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I&#8217;ve tried lots of hacky things but nothing seems to do the trick.  From some Google searching, it seems that perhaps it is a timeout issue, but I can&#8217;t see how to modify the perl library to allow for a longer timeout.</p>
<p>So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing.  Now it only crashes out of the loop if that new connector also fails to write.  And I also don&#8217;t crash until I save my place in the CSV file, so I don&#8217;t repeat myself.  So  I&#8217;m not getting a complete failure, but it is still super slow.</p>
<p>While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.</p>
<p>Interesting probably unrelated fact.  The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.</p>
<p>Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script.  Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job;  the March jobs are closer together, but that Tyrant job started about an hour before everything else).</p>
<p>I guess there is still a way for Tokyo Tyrant to win this race.  I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data.  It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster.  We&#8217;ll see.</p>
<p>&nbsp;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/220/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/220/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/220/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/220/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/220/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/220/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/220/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/220/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/220/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/220/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=220&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/11/02/tokyo-tyrant-throwing-a-tantrum/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Tokyo Tyrant is cool</title>
		<link>http://contourline.wordpress.com/2009/10/30/tokyo-tyrant-is-cool/</link>
		<comments>http://contourline.wordpress.com/2009/10/30/tokyo-tyrant-is-cool/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 06:30:35 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=218</guid>
		<description><![CDATA[Just to have a recollection of this later, some notes.
setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot
One thing I noticed was that in shifting from one [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=218&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Just to have a recollection of this later, some notes.</p>
<p>setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot</p>
<p>One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I&#8217;m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).</p>
<p>There is write delay of course, and that may be all of it, since the files are big now.</p>
<p>Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?</p>
<p>Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.</p>
<p>One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn&#8217;t strictly the same data&#8230; I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.</p>
<p>So, slotting 4 months of data away.  I&#8217;ll check it again on Monday and see if it worked.</p>
<p>And by the way, I&#8217;m sure I&#8217;m not the best at this because I haven&#8217;t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don&#8217;t want to have all of that data sitting in my relational database, but I&#8217;m just saying&#8230;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/218/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/218/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/218/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/218/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/218/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/218/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/218/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/218/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/218/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/218/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=218&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/10/30/tokyo-tyrant-is-cool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Putting stuff away</title>
		<link>http://contourline.wordpress.com/2009/10/26/putting-stuff-away/</link>
		<comments>http://contourline.wordpress.com/2009/10/26/putting-stuff-away/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 16:34:04 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=214</guid>
		<description><![CDATA[Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I&#8217;m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=214&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Started testing out <a href="http://1978th.net/tokyocabinet/" target="_blank">TokyoCabinet </a>and <a href="http://1978th.net/tokyotyrant/" target="_blank">TokyoTyrant</a> last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I&#8217;m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!</p>
<p>Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I&#8217;ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.</p>
<p>I&#8217;m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can&#8217;t find it yet.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/214/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/214/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/214/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/214/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/214/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=214&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/10/26/putting-stuff-away/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Swinger is cool, Sammy looks cooler</title>
		<link>http://contourline.wordpress.com/2009/09/30/swinger-is-cool-sammy-looks-cooler/</link>
		<comments>http://contourline.wordpress.com/2009/09/30/swinger-is-cool-sammy-looks-cooler/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 06:02:57 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[sakai]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=196</guid>
		<description><![CDATA[Just tried out swinger.  It is cool.  But I can&#8217;t get authorization to work right using the trunk checkout of couch (0.11.blahblah_git).  Something to hack on
But I&#8217;m more interested in playing with Sammy.js.  The two application stack figures on the blog page (and in the Swinger slides) are interesting.  Take away the couchdb bit, add [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=196&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Just tried out <a href="http://github.com/quirkey/swinger" target="_blank">swinger</a>.  It is cool.  But I can&#8217;t get authorization to work right using the trunk checkout of couch (0.11.blahblah_git).  Something to hack on</p>
<p>But I&#8217;m more interested in playing with <a href="http://github.com/quirkey/sammy" target="_blank">Sammy.js</a>.  The two application stack figures on the <a href="http://www.quirkey.com/blog/2009/09/15/sammy-js-couchdb-and-the-new-web-architecture/" target="_blank">blog page</a> (and in the Swinger slides) are interesting.  Take away the couchdb bit, add Sakai&#8217;s K2, and you&#8217;ve got a very similar picture.  Sure couchdb can serve the app with attachments to the _design doc, but that&#8217;s not the point.  The point is being able to stick documents into a db and then get them out again in interesting ways without having to bend over backwards on the server side.</p>
<p>But again, I have to play with it for a while and see what it can do.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/196/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=196&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/09/30/swinger-is-cool-sammy-looks-cooler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Cow Chap</title>
		<link>http://contourline.wordpress.com/2009/04/28/cow-chap/</link>
		<comments>http://contourline.wordpress.com/2009/04/28/cow-chap/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 05:37:42 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[couchdb]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=171</guid>
		<description><![CDATA[Digesting Couchapp.
I ran through the documentation at http://wiki.github.com/jchris/couchapp/manual, and set up a test site.  I looked through the test, and saw a buncha stuff I didn&#8217;t write.  I like that and don&#8217;t like that.  I find app builders lead to cruft laying around&#8212;like I noticed jquery 1.2.6 when the latest is 1.3.2 if I remember [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=171&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Digesting Couchapp.<span id="more-171"></span></p>
<p>I ran through the documentation at http://wiki.github.com/jchris/couchapp/manual, and set up a test site.  I looked through the test, and saw a buncha stuff I didn&#8217;t write.  I like that and don&#8217;t like that.  I find app builders lead to cruft laying around&#8212;like I noticed jquery 1.2.6 when the latest is 1.3.2 if I remember correctly.  And from the documentation on the wiki, I didn&#8217;t really understand what all the files were.  Of course there is no data in the DB, I expected a no-op application, but I didn&#8217;t see even *why* all that stuff was there in the lib and vendor and so on.</p>
<p>Then I finally hit upon the actual README at http://github.com/jchris/couchapp/tree/master.  Perhaps it is just years of reading text books, but I found this page to be much more helpful.  I think I get it.  All the library stuff gets shoved into the couchdb as part of the application.  Then the couchapp glue uses macros to leverage the libraries.  Kinda like lots of other programming languages do it, but probably closest to how most HTML templating languages work.  You follow the template construct, for example,</p>
<pre><code>// !json lib.templates.post
</code></pre>
<p>and the output is expanded according to what is found at lib.templates.post.  In the test app case, I don&#8217;t have a lib.templates.post, but I do see a lib.templates.example that shows an html document.  And of course, the thing at the end of the macro rainbow can be useful too, with javascript and queries to the db, etc etc.</p>
<p>Apparently these macros can be used for views (map, and I assume reduce), and it says lists and shows, but I am not yet familiar with those two constructs.</p>
<p>As I said at the start, lots to digest.</p>
<p>And as a quick update, lists and shows are cool too.  Very similar to how I process json now in javascript, but on the serverside, and allowing a bit more flexibility.  I&#8217;m thinking pulling filenames from couchdb based on meta data in the doc, and mapping those to actual image files.  I still think I should be wrapping my couch apps in Perl or Java, but there is less and less work to do by the wrapper as this project matures.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/171/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=171&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/28/cow-chap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Close still doesn&#8217;t count &#8230;</title>
		<link>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/</link>
		<comments>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 23:13:13 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=166</guid>
		<description><![CDATA[&#8230; except for nukes and bocci.
I can *almost* make bootstrapping work, but not entirely within couchdb.  I am going to have to do external processing.  Which is probably fine.  
Anyway, here&#8217;s where I am so far.  I am loading up one database per detector, with documents that look like:
{
   "_id": "40130160",
   [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=166&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>&#8230; except for nukes and bocci.</p>
<p>I can *almost* make bootstrapping work, but not entirely within couchdb.  I am going to have to do external processing.  Which is probably fine.  <span id="more-166"></span></p>
<p>Anyway, here&#8217;s where I am so far.  I am loading up one database per detector, with documents that look like:</p>
<pre>{
   "<code class="key">_id</code>": <code class="string">"40130160"</code>,
   "<code class="key">_rev</code>": <code class="string">"1-1446962830"</code>,
   "<code class="key">CV_OCC_1</code>": <code class="string">"0.29665"</code>,
   "<code class="key">LAG1_OCC_1</code>": <code class="string">"-0.02416"</code>,
   "<code class="key">CORR_OCC_1M X LAG1_OCC_R</code>": <code class="string">"0.07549"</code>,
   "<code class="key">Severity--PDO</code>": <code class="string">"0.46376"</code>,
   "<code class="key">LAG1_OCC_R</code>": <code class="string">"0.15293"</code>,
   "<code class="key">left lane accident</code>": <code class="string">"0.055"</code>,
   "<code class="key">2 veh accident</code>": <code class="string">"0.22823"</code>,
   "<code class="key">vdsid</code>": <code class="string">"1203692"</code>,
   "<code class="key">SD_VOL_M</code>": <code class="string">"3.16066"</code>,
   "<code class="key">EstimateTime</code>": <code class="string">"2008-02-24T18:00:00-0800"</code>,
   "<code class="key">CORR_VOL_1R</code>": <code class="string">"0.45782"</code>,
   "<code class="key">month</code>": <code class="string">"02"</code>,
   "<code class="key">1 veh accident</code>": <code class="string">"0.0597"</code>,
   "<code class="key">day</code>": <code class="string">"Sun"</code>,
   "<code class="key">CV_OCC_R</code>": <code class="string">"0.47489"</code>,
   "<code class="key">CORR_VOL_1M</code>": <code class="string">"0.4268"</code>,
   "<code class="key">MU_VOL_M</code>": <code class="string">"13.60"</code>,
   "<code class="key">CV_OCC_M</code>": <code class="string">"0.26119"</code>,
   "<code class="key">LAG1_VOL_M</code>": <code class="string">"-0.07773"</code>,
   "<code class="key">CORR_OCC_1R</code>": <code class="string">"0.21826"</code>,
   "<code class="key">IntervalSeconds</code>": <code class="string">"1200"</code>,
   "<code class="key">MU_VOL_R</code>": <code class="string">"11.625"</code>,
   "<code class="key">CORR_VOLOCC_1M</code>": <code class="string">"0.22346"</code>,
   "<code class="key">VOL_M</code>": <code class="string">"15.00"</code>,
   "<code class="key">LAG1_OCC_M</code>": <code class="string">"-0.00962"</code>,
   "<code class="key">VOL_1</code>": <code class="string">"20.00"</code>,
   "<code class="key">OCC_1</code>": <code class="string">"0.12778"</code>,
   "<code class="key">CV_VOLOCC_1 X CORR_VOLOCC_1M</code>": <code class="string">"0.01314"</code>,
   "<code class="key">Severity--Injury</code>": <code class="string">"0.10563"</code>,
   "<code class="key">OCC_M</code>": <code class="string">"0.11444"</code>,
   "<code class="key">fiveminute</code>": <code class="string">"18:00:00"</code>,
   "<code class="key">MU_VOL_1</code>": <code class="string">"13.20"</code>,
   "<code class="key">CORR_OCC_1M</code>": <code class="string">"0.4936"</code>,
   "<code class="key">CV_VOLOCC_1</code>": <code class="string">"0.05882"</code>,
   "<code class="key">SumVol</code>": <code class="string">"1,537.00"</code>,
   "<code class="key">CORR_VOLOCC_1R</code>": <code class="string">"0.28207"</code>,
   "<code class="key">SD_VOL_R</code>": <code class="string">"3.05243"</code>,
   "<code class="key">OCC_R</code>": <code class="string">"0.14"</code>,
   "<code class="key">CV_VOLOCC_R</code>": <code class="string">"0.21446"</code>,
   "<code class="key">off road accident</code>": <code class="string">"0.0949"</code>,
   "<code class="key">CORR_OCC_1M X MU_VOL_M</code>": <code class="string">"6.71293"</code>,
   "<code class="key">MuVolocc</code>": <code class="string">"352.40435"</code>,
   "<code class="key">LAG1_VOL_R</code>": <code class="string">"-0.14296"</code>,
   "<code class="key">VOL_R</code>": <code class="string">"11.00"</code>,
   "<code class="key">interior lanes accident</code>": <code class="string">"0.14903"</code>,
   "<code class="key">CORR_VOL_MR</code>": <code class="string">"0.41195"</code>,
   "<code class="key">CORR_VOLOCC_MR</code>": <code class="string">"0.21033"</code>,
   "<code class="key">LAG1_VOL_1</code>": <code class="string">"-0.01163"</code>,
   "<code class="key">CORR_OCC_MR</code>": <code class="string">"0.31359"</code>,
   "<code class="key">3+ veh accident</code>": <code class="string">"0.09185"</code>,
   "<code class="key">CORR_OCC_1M X SD_VOL_R</code>": <code class="string">"1.50667"</code>,
   "<code class="key">year</code>": <code class="string">"2008"</code>,
   "<code class="key">CV_VOLOCC_M</code>": <code class="string">"0.06651"</code>,
   "<code class="key">right lane accident</code>": <code class="string">"0.08891"</code>,
   "<code class="key">any accident</code>": <code class="string">"0.38626"</code>,
   "<code class="key">SD_VOL_1</code>": <code class="string">"3.59629"</code>
}</pre>
<p>Then I have a view with the following map</p>
<pre>function(doc) {
    if(doc.year){
	var name="any accident";
	emit(doc._id, doc[name] - 0);
    }
}</pre>
<p>and a reduce that is more or less  same as the knuthian mean and variance that I wrote up in an earlier post.  My idea was to do bootstrap sampling by just using the POST {&#8220;keys&#8221;: ["key1", "key2", ...]} call documented on the <a href="http://wiki.apache.org/couchdb/HTTP_view_API" target="_blank">http view api page</a>.  But it doesn&#8217;t work, or rather, it works, but the API requires group=true.  So what I get  out is something like:</p>
<pre>curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=true' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}'
{"rows":[
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}},
{"key":"40130190","value":{"M2":0,"n":1,"mean":0.32849,"min":0.32849,"max":0.32849,"variance_n":0}},
{"key":"40130191","value":{"M2":0,"n":1,"mean":0.31275,"min":0.31275,"max":0.31275,"variance_n":0}},
{"key":"40130192","value":{"M2":0,"n":1,"mean":0.31403,"min":0.31403,"max":0.31403,"variance_n":0}},
{"key":"40130193","value":{"M2":0,"n":1,"mean":0.30753,"min":0.30753,"max":0.30753,"variance_n":0}}
]}</pre>
<p>If I don&#8217;t call it with group=true, I get an error:</p>
<pre>curl 'http://localhost:5985/safetydb1213891/_design/Any/_view/bytime?group=false' -d '{"keys":["40130190","40130191","40130192","40130193","40130190","40130191","40130192","40130193"]}'
{"error":"query_parse_error","reason":"Multi-key fetches for a reduce view must include group=true"}</pre>
<p>So I guess that means if I stick with this approach, I will need to ditch the reduce entirely, and do processing in an external program.</p>
<p>I haven&#8217;t yet tried my daily average approach, where a single document contains an entire day.  I don&#8217;t expect that m out of n sampling will work, at least not with a random number generator in there, as there is that requirement in the couchdb docs that a view always produce the same output given the same input.  But a balanced approach should work, as long as the permutation process is &#8220;pseudo-random&#8221; and repeatable for the day.  (Pick any normal number and use that).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/166/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=166&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/10/close-still-doesnt-count/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>More thoughts on using bootstrap</title>
		<link>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/</link>
		<comments>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 17:19:09 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=160</guid>
		<description><![CDATA[Closer, but still not yet there using bootstrap sampling in Couchdb.   My prior post was mostly thinking out loud.  I&#8217;ve tried some things since, and this post is an attempt to organize my thoughts on the topic.
The first thing I tried was to submit a list of document ids to a view, and see what [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=160&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Closer, but still not yet there using bootstrap sampling in Couchdb.   My prior post was mostly thinking out loud.  I&#8217;ve tried some things since, and this post is an attempt to organize my thoughts on the topic.</p>
<p><span id="more-160"></span>The first thing I tried was to submit a list of document ids to a view, and see what happened.  This might work, and it might not.  It certainly won&#8217;t work as I expected.  That is, I have to do a very flat view&#8212;the map has to emit doc._id, value, and the reduce has to compute the statistic of interest over all of the input values.  I haven&#8217;t tried this yet, but my guess is that this will simply requery the reduce view with all of the input values.  So not time is saved by CouchDB&#8217;s caching of views.</p>
<p>Another approach is to put a random sequence in the view and sample from that.  The problem there is that I need to recompute the view everytime.  Using external programs, I will have to query the db for the list of docids, sample those with replacement to build my bootstrap sample, then create a view and submit it to the one-off view processor.  Given that the view can&#8217;t be cached anyway, the performance hit for this approach will always be paid, so no big deal not having a cached view.  Still, it would be nice to not have to rewrite the view every time I want to use it just because the database has grown.</p>
<p>Another approach that I am thinking about now is to save my data differently.  Instead of saving as one document the output of the next observation&#8217;s computations, instead collect those results into an entire day&#8217;s worth of data, and stuff the db with that.  Unfortunately, I&#8217;ll also have to rewrite my java code, as at the moment I am grabbing a few hours of time across all detectors.  Instead I&#8217;ll need to grab a day&#8217;s data across a single detector.</p>
<p>With this method, I *think* I can implement a bootstrap sampling with replacement inside of the javascript.  The rules of the view engine are preserved&#8212;each document is independent of every other, and no reliance on other documents in the database to process a single document.  The map part will sample with replacement from the observations for that day, and then emit as many replicates as I want for the day.</p>
<p>Which brings up another topic I am still unsure about.  Most of the bootstrap references talk about sampling n times from an original sample of size n.  That is, if it is 1,000 observations, each sample has 1,000 observations.  There is some discussion in Chernick&#8217;s book on about p 178 or so about using m out of n sampling, that is, drawing a smaller sample than n.  The rule is pretty vague, something about m being on the same order as n, but increasing at a smaller rate than n.  So as n goes to infinity, so does m, but m/n goes to zero.  That is really broad, and I need to get a better source and/or try it out for myself.  Anyway, it seems like log(n) would fit this rule, but would be a terribly small sample.</p>
<p>The point of using the m of n sample is to reduce the impact of outliers or a fat tail.  I do have outliers in my data, so it makes sense to use it.  I guess the best solution is to test it versus, say, the balanced sampling approach (b random permutations of the n observations sampled b times), and then inspect the differences in the resulting bias and variance for both estimates.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/160/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/160/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/160/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=160&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/10/more-thoughts-on-using-bootstrap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>Bootstrap in a view</title>
		<link>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/</link>
		<comments>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 19:35:20 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=156</guid>
		<description><![CDATA[Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=156&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Inspired by <a title="MAD skills" href="http://databeta.wordpress.com/2009/03/20/mad-skills/" target="_self">this post</a>, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I am not a statistician, so my definition should not be used as gospel, but bootstrapping is a statistical method where one randomly samples from an observed set of data in order to determine some statistics, such as the mean or the median.  Most of the older sources I&#8217;ve read talk about using it for small to medium sized data sets, etc., and so the k samples are all of size n.  But I can&#8217;t do that&#8212;my input data is too big.  So I have to pick a smaller n.  So I&#8217;m going with 1,000 for starters, and repeat the draw 10,000 times.</p>
<p>(There&#8217;s probably a secondary bootstrap I can do there to decide on the optimal size of the bootstrap sample, but I&#8217;m not going to dive into that yet.)<span id="more-156"></span></p>
<p>So how to draw 1,000 random samples/documents from a CouchDB database and repeat that process 10,000 times efficiently?    My first inclination was to use the map part of the view, but I suspect that this will fail.  My reading of the docs is that themap must produce a consistent output given the same input. So if the map includes a hard-coded list of samples, great.  But I&#8217;m not so sure the map function is allowed to produce this random list on invocation&#8212;then the map would not be consistent between runs, which would violate all kinds of assumptions I&#8217;m sure.</p>
<p>So the next approach would be to sample all the data, but pull a random subsample with replacement when querying the view.  So in this case, all of the data is used to generate the statistic in question (mean, median, whatever), plus confidence bounds, and then the query does the sampling with replacement.</p>
<p>To do this, my query program (it&#8217;d have to be programmatic) would first ping the db and get the docids, then randomly draw from those with replacement k times.  Each combination of ids would then get sent to the db using the post  {&#8220;keys&#8221;: ["key1", "key2", ...]} api  semantic.</p>
<p>But that might not work for two reasons.  First, I&#8217;m not so sure that the post will allow duplicate keys, although I guess this is easy enough to test.  Second, this still requires the complete computation of the map/reduce view function over the entire data set.  I already know this takes a long time and uses a lot of space, so I don&#8217;t really want to do that if I don&#8217;t have to.  Still, this approach seems pretty good.</p>
<p>The last option I&#8217;m thinking about is to do the sampling in the reduce function, not in the map.  There was a response to one of my questions on the mailing list that if a  view is byte-identical to another view, then it is only computed once.  So with that in mind, I could write multiple reduce functions paired to identical map functions that subsampled the available docs in different ways.  But I think this will just turn out to be slower in the end than putting the hardcoded sampling scheme into the view function. The reason I say this is because the map is compiled with every application, whereas the view is just compiled once.  So parsing and loading a long hard-coded list of samples to keep (with possible multiple samples) is going to more time because it is done multiple times.  And I&#8217;d need to either keep 10,000 different reduce functions, all with a unique sampling list, or else generate one monster reduce function that simultaneously computes output on 10,000 possible samples.  I see the input set for the reduce job getting sent down each pipe in the matrix, and either being kept, processed, or processed multiple times on its way out the other end, with the final reduce step producing the final 10,000 statistic(s) and confidence bounds.   I&#8217;m thinking of a bit-mask index matrix like in R, but maybe it would be more efficient to make it a hash map, with an integer value representing the number of times to include the doc (because resampling means there can be multiples and all that).</p>
<p>The final option that I am *not* going to consider is to use a temporary view.  From the couchdb wiki: &#8220;<strong>Temporary views</strong> are not stored in the database, but rather executed on demand.&#8221;   So at first this sounds pretty good, but I think it is misleading.  I&#8217;d expect that in a proper bootstrap sampling, most of the observations will get sampled at least once.  In that kind of a situation, I think the second option (building a simple statistic map/reduce view and then sampling it on query) is best, because the computations are cached.  Assuming of course that the map part is non trivial, and the reduce part is relatively fast.</p>
<p>Anyway, plenty to work on this weekend&#8230;as always, real tests of these ideas will reveal the truth.</p>
<p>Update:  I just tried some things out, and getting clued in a little bit more.  The keys for the post are keys emitted by the map/reduce output, *not* the input documents.  My current map/reduce view stratifies by time of day, risk prediction as the key.  If I instead want random draws over all observations, my output is going to have to be keyed by the document_ids.  That is pretty unworkable, I think, unless I have separate views for each time period of interest, each risk prediction, and emit keys that are the doc id.  Again, I&#8217;ll have to try it and see.  But option 1 is moving up as the most likely anyway, especially as I just read  today about balanced resampling to reduce variance in bootstrap methods in Chernick&#8217;s 2008 book, Bootstrap Methods.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/156/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=156&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/04/03/bootstrap-in-a-view/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
		<item>
		<title>A lot of data is a lot of data</title>
		<link>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/</link>
		<comments>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/#comments</comments>
		<pubDate>Fri, 06 Mar 2009 18:35:42 +0000</pubDate>
		<dc:creator>jmarca</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[transportation]]></category>

		<guid isPermaLink="false">http://contourline.wordpress.com/?p=150</guid>
		<description><![CDATA[I can&#8217;t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple&#8212;every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=150&subd=contourline&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I can&#8217;t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple&#8212;every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.</p>
<p><span id="more-150"></span>So one db doesn&#8217;t work so well.  There was a posting to the user mailing list asking how many databases people were using.  So I gave that a shot&#8212;trying one db per loop, with the same map/reduce.  The downside is that I can&#8217;t compute averages across loops as before, but that&#8217;s okay because I couldn&#8217;t get the view generation to finish at all.</p>
<p>But I still have problems.  The main couchdb process (beam in top) runs up to 80 or 90 percent CPU usage, and there are lots of javascript child processes split off, but the view computation is still super slow, even on just a few days of data (14 GB uncompressed, including view cache)</p>
<p>I&#8217;m thinking that the only way to avoid this problem is to keep updating the view with every insert into the database.  But I&#8217;m worried that will fall behind real time, let alone allow me to move backwards and process last year&#8217;s data too.  Without any super quantitative measures, it seems from my experience that if I get about a gigabyte behind the curve on computing the cached view, I can&#8217;t keep up&#8212;data loading goes too fast, and index processing never finishes.   Or I get mystery errors like:</p>
<pre>at /usr/lib/perl5/site_perl/5.8.8/i486-linux-thread-multi/Coro.pm line 419
[
  {
    Reason =&gt; "Connection timed out",
    Status =&gt; 599,
    URL =&gt; "http://localhost:5984/safetydb1204650/_view/riskstats2/All",
  },
  undef,
] at ping_couchdbs.pl line 61</pre>
<p>from my program that pings the views for all of the databases.</p>
<p>So maybe for now I need to go back to postgresql.  I do like the map reduce part of Couchdb, and I do like the unstructured doc format, but perhaps it isn&#8217;t so good for massive number crunching yet. But not being able to get a year of data in and a valid annual average out is kind of a show stopper.</p>
<p>But to be fair, I couldn&#8217;t do that in Postgresql either.  My use of Couchdb may help there, as I now have a document centric view of the data, rather than a relational view.  And before I get off the couch (sorry), I still need to look into alternate view servers.  Maybe I can make a view server in C that can run faster than the javascript map/reduce process.   But that will have to wait until next weekend probably while I finish other projects.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/contourline.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/contourline.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/contourline.wordpress.com/150/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=contourline.wordpress.com&blog=718724&post=150&subd=contourline&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://contourline.wordpress.com/2009/03/06/a-lot-of-data-is-a-lot-of-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">jmarca</media:title>
		</media:content>
	</item>
	</channel>
</rss>