Contour Line

January 18, 2012

How big is “too big” for documents in CouchDB: Some biased and totally unscientific test results!

Filed under: couchdb,Uncategorized — jmarca @ 3:56 pm

I have been storing documents somewhat heuristically in CouchDB. Without doing any rigorous tests, and without keeping track of versions and the associated performance enhancements, I have a general rule that tiny documents are too small, and really big documents are too big.

To illustrate the issues, consider a simple detector that collects data every 30 seconds. One approach is to create one document per observation. Over a day, this will create 2880 documents (except for those pesky daylight savings time days, of course). Over a year, this will create over a million documents. If you have just one detector, then this is probably okay, but if you have thousands or millions of them, this is a lot of individual documents to store, and disk size becomes an issue.
(more…)

January 7, 2012

I want never gets

Filed under: Uncategorized — jmarca @ 6:28 pm

I want to buy an Atlantis from Rivendell . For some reason, I can never pull the trigger.

December 11, 2011

Iterating view and doc designs multiple times

Filed under: couchdb,Uncategorized — jmarca @ 4:31 pm

Just a quick post so that I remember to elaborate on this later.  I have found that whenever I have a large project to do in CouchDB I go through several iterations of designing the documents and the views.

My latest project is typical.

  1. First design was to push in really big documents.  The idea was to run map reduce copy the reduce output to a second db, and map reduce that for the final result.   But the view generation was too slow, I never got around to designing the second db, and the biggest documents triggered a bug/memory issue.
  2. (more…)

December 8, 2011

Watching views build oh so slowly

Filed under: couchdb — jmarca @ 10:50 pm

I have an application that is taxing my PostgreSQL install, and I’ve been taking a whack at using CouchDB to solve it instead.

On the surface, it looks like a pretty good use case, but I’m having trouble getting it to move fast enough.

In a nutshell, I am storing the output of a multiple imputation process. At the moment my production system uses PostgreSQL for this. I store each imputation output, one record per row. I have about 360 million imputation stored this way.

Each imputation represents an estimate of conditions at a mainline freeway detector. That is done in R using the excellent Amelia package. While the imputation is done for all lanes at the site, because I am storing the data in a relational database with a schema, I decided to store one row per lane. (more…)

December 7, 2011

Replicator database in practice

Filed under: couchdb,Uncategorized — jmarca @ 11:19 pm

The replicator database in couchdb is cool, but one needs to be mindful when using it.

I like it better than sending a message to couch db to replicate dbx from machine y to machine z, because I can be confident that even if I happen to restart couch, that replication is going to finish up.

The problem is that for replications that are not continuous, I end up with a bunch of replication entries in the replicator database. Thousands sometimes. Until I get impatient and just delete the whole thing.

For the way I use it, the best solution is to write a view into the db to pick off all of the replications that are not continuous and that have completed successfully, and then do a bulk delete of those documents. But I’m never organized enough to get that done.

Here’s hoping such a function finds its way into Futon some day.

November 9, 2011

When R and JSON fight

Filed under: couchdb,R — jmarca @ 2:01 pm

I have a love hate relationship with R. R is extremely powerful and lots of fun when it works, but so often I spend hours at a time wondering what is going on (to put my irritation in printable prose)

Today I finally figured out a nagging problem. I am pulling data from CouchDB into R using the excellent RJSONIO and RCurl libraries. JSON has a strict requirement that unknown values are called null, while R has a more nuanced concept that includes NA as well as NULL. My original usage of the RJSONIO library to save data to CouchDB had to account for this fact, by using a regular expression to convert NA to proper JSON null values. (I think the latest version of RJSONIO might actually handle this better, but I haven’t checked as my current code works fine since the regex is conditional).

Now coming the other way, from CouchDB into R, RJSONIO’s fromJSON() function will happily convert JSON null values into R NULL values. My little getCouch() function looks like this:

couch.get <- function(db,docname, local=TRUE, h=getCurlHandle()){

  if(length(db)>1){
    db <- couch.makedbname(db)
  }
  uri <- paste(couchdb,db,docname,sep="/");
  if(local) uri <- paste(localcouchdb,db,docname,sep="/");
  ## hack to url encode spaces
  uri <- gsub("\\s","%20",x=uri,perl=TRUE)
  fromJSON(getURL(uri,curl=h)[[1]])

}

The key line is the last one, where the results of RCurl’s getURL() function are passed directly to RJSONIO’s fromJSON() and then returned to the caller.

In my database, to save space, each document is a list of lists for a day.

{
   "_id": "1213686 2007-02-28 10:38:30",
   "_rev": "1-c8f0463d1910cf4e89370ece6ef250e2",
   "data": {
       "nl1": [9,12,12, ... ],
       "nr1": [ ... ],
       ...
       "ts" : [ ... ]
   }
}

Every entry in the ts list has a corresponding entry in every other array in the data object, but that entry could be null. This makes it easy to plot the data against time (using d3, but that is another post) or reload back into R with a timestamp.

But loading data into R isn’t quite the one-liner I was expecting, because of how R handles NULL compared to NA. My first and incorrect attempt was:

alldata <- doc$data
colnames <- names(alldata)
## deal with non ts first
varnames <-  grep( pattern="^ts$",x=colnames,perl=TRUE,invert=TRUE,val=TRUE )
## keep only what I am interested in
varnames <-  grep( pattern="^[no][lr]\\d+$",x=varnames,invert=TRUE,perl=TRUE,val=TRUE )
data.matrix <- matrix(unlist(alldata[varnames]),byrow=FALSE,ncol=length(varnames))

First I grab just the data object, pull off the variables of interest, then make a matrix out of the data.

The problem is that the recursive application of unlist buried in the matrix command. The alldata object is really a list of lists, and some of those lists have NULL values, so recursive application of unlist SILENTLY wipes out the NULL values (So IRRITATING!)

Instead what you have to do is carefully replace all numeric NULL values with what R wants: NA. (And this is where learning how to do all that callback programming in javascript comes in handy, as I define a callback function for the lappy method inline and don’t get worked up about it anymore.)

  ## first, make NULL into NA
  intermediate <- lapply(alldata[varnames],function(l){
    nullmask <- unlist(lapply(l, is.null))
    l[nullmask] <- NA
    l
  })
  ## then do the unlisting
  data.matrix <- matrix(unlist(intermediate),byrow=FALSE,ncol=length(varnames))

Most of the time the simple way worked fine, but it required special handling when I slapped the timeseries column back onto my data. What I ended up having to do (when I was just hacking code that worked (TM)) was to drop timestamps for which all of the rows of data I was interested in were all NULL. And yes, the logic was as tortured as the syntax of that sentence.

But every once in a while the data would be out of sync, because sometimes there would be different numbers of NULL values in the variables I was extracting (for example, the mean would be fine, but one of the correlation coefficients would be undefined). In those cases the loop would either work and be wrong (if the odd numbers of NULL data was perfectly aliased with the length of varnames), or else it would crash and get noted by my error handler.

With the new explicit loop to convert NULL to NA, the loading function works fine, with no more try-errors returned from my try call. And even better, I no longer have to lie awake nights wondering whether some data was just perfectly aliased with missing values so that it slipped through.

November 4, 2011

Slacking on the Couch

Filed under: couchdb,slackware — jmarca @ 2:18 pm

I run Slackware. I also use CouchDB. Seems like a natural fit, but the slackbuild on SlackBuilds.org is stuck at 0.11.

That’s okay, it is a good script and works well with the latest version. However, I don’t want to run the latest release of CouchDB, I want to run 1.2.x from the git repository, because I really like the new replication engine for my work.

So, I had to do some tinkering with the SlackBuild script. (more…)

October 27, 2011

super useful page for html escape codes

Filed under: Uncategorized — jmarca @ 12:06 pm

CouchDB wants its fancy startkey and endkey values properly escaped. So that means I have to look up ‘[' and ']‘ and so on for their hexadecimal equivalents. I usually turn to this super useful page, even though it is way down on the search results. The others look like spam websites.

So, tune your linkages to http://web.cs.mun.ca/~michael/c/ascii-table.html

Update: or as that anonymous comment says below, man 7 ascii

keys() is an Object method

Filed under: Uncategorized — jmarca @ 9:24 am

For a little while I didn’t really *get* the difference between underscore and async in node.js.

Yesterday I wrote up some code to copy some data out of PostgreSQL and into CouchDB. At some point, I have a big object whose keys are the document IDs in my CouchDB database, and I needed to fire up request to update each document in turn. Because I’m lazy, I usually use _.keys(object) to get an object’s keys so my server-side and client-side javascript follow the same conventions. To apply a function to the object key value pairs, I would normally use _.each(object,function(value,key){...}), but in this case, where I want a little more control over how many simultaneous GETs then PUTs I fire off at CouchDB, underscore’s each is a little awkward to use.

In the past I’ve hacked up self-made limiters, but as I use async more I’ve been learning about useful ways to combine its functions. In this case, I made an async.whilst loop that splices out 30 or so ids, then uses async.forEach() to fire off request operations for each of these document ids. The request operations themselves are nested—I usually try to use pipe whenever I can (pipe the get into the put), but I haven’t yet tested what happens when you modify the document in between the get and the put.

In short, my current approach is that when I want simple iterators I use underscore, but if there is a whiff of blocking in the call, I will use async instead. As I follow this convention, it begins to get more and more useful. In underscore, the function just runs. If it is something like a request call that will return right away and go do something asynchronously, then I have to program my own solution to figuring out when that call is done. In contrast, async makes liberal use of callbacks. async.forEach() will also fire off lots of simultaneous request objects, but it passes its own callback function to each one of them, and I can trigger them all in the final callback in my request invocation. Very handy. And then async has a third optional argument that is a function to execute when all of the parallel forEach calls are done. Again, very handy, and much cleaner than hacking up my own solution.

Which brings me back to my title. Because I’m not using underscore in this case, I suddenly didn’t want to use _.keys(object) to get the list of keys. Naively I tried object.keys(), but that is an error. The proper semantics is Object.keys(object), and I learned something super basic at the same time that I am settling far more complicated usage patterns.

September 10, 2011

Overcoming shy programmer syndrome

Filed under: Uncategorized — jmarca @ 10:33 pm

I write a lot of programs, but I never publish them for others to use. Now with git and github, there aren’t any more real excuses.

Because I have been documenting like mad and cleaning up code, I am also taking the opportunity to push up working packages to github. So far I’ve pushed up two node.js utilities I am using. One is called makedir, and the other is called cas_validate
(more…)

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.