Dump a doc from CouchDB with attachments

In order to dummy up a test in node.js, I need data to populate a testing CouchDB database. Specifically, I am testing some code that creates statistics plots (in R) and then saves them to a doc as attachments. So for my tests, I need at least one document with its PNG attachments already in place.

I couldn’t find a simple “howto” for this on the Internet, so here’s a note to my future self.

First of all, the CouchDB docs are great, and curl is your friend. Curl lets you set the headers. In this case, I don’t want HTML to come back, I want a valid JSON document, so (in typical belt-and-suspenders style) I specify both the content type and the accept header parameters to be application/json as follows:

curl -H 'Content-Type: application/json' \
-H 'Accept: application/json' \
127.0.0.1:5984/my%2freal%2fdatabase/801447?attachments=true> 801447.json

The returned document has encoded the binary PNG files as JSON fields, in accordance with the CouchDB specs:

{"_id":"801447","_rev":"55-8e15623f21dce9ed556cfe96b9c85a8e",
"2012":{"properties":[
  {"name":"SERFAS CLUB",
   "cal_pm":"R3.688",
   "abs_pm":40.920000000000001705,
   "latitude_4269":"33.880712",
   "longitude_4269":"-117.613596",
   "lanes":1,
   "segment_length":"0.316",
   "freeway":91,
   "direction":"E",
   "vdstype":"ML",
   "district":8,
   "versions":["2012-12-04","2012-12-12"],
     "geojson":{"type":"Point",
                "crs":{"type":"name",
                       "properties":{"name":"EPSG:4326"}},
                "coordinates":[-117.62000000000000455,
                                 33.881000000000000227]}
               }
   ]},
"_attachments":{
 "801447_2012_raw_004.png":{
  "content_type":"image/png",
  "revpos":53,"digest":"md5-tF2vnhvNw7pLHlK31DVNUw==",
  "data":"iVBORw0KGgoAAAANSUhEUgAABkAAAAGQCAIAAAB59ztRAAAgAElEQVR4
nOzdeYAUxd038Opr7tmbXVYRXEQEOeSSy3greK1sIJqIRImaRONLPBJDTFBRDGp4Dl
GjicYj4oEJyimsyHItyiWPoCAYjQQQuZZll71m5+r3jwrtOEd1z0xP9TDz/fzDzNBb
v6ruquqemupqQVVVAgAAAAAAAAAAkK1EqzMAAAAAAAAAAADAggEsAAAAAAAAAADIah
jAAgAAAAAAAACArIYBLAAAAAAAAAAAyGoYwAIAAAAAAAAAgKyGASwAAAAAAAAAAMhq
GMACAAAAAAAAAICshgEsAAAAAAAAAADIahjAAgAAAAAAAACArIYBLAAAAAAAAAAAyG
oYwAIAAAAAAAAAgKyGASwAAAAAAAAAAMhqGMACAAAAAAAAAICshgEsAAAAAAAAAADI
ahjAAgAAAAAAAACArIYBLAAAAAAAAAAAyGoYwAIAAAAAAAAAgKyGASwAAAAAAAAAAM
hqGMACAAAAAAAAAICshgEsAAA ..."

Lovely binary-to-hex, looking good.

To verify that the returned document is actually valid json, I use the command line some more (and I’m not sure which Linux library installed this, but there are several JSON pretty printers and verifiers out there):

james@emma files[bug/fixplots]$ json_verify< 801451.json

JSON is valid

Then to use the document in my test, all I have to do is read it in and send it off:

function put_json_file(file,couchurl,cb){
    var db_dump = require(file) // in node you can require json too!
    superagent.post(couchurl)
    .type('json')
    .send(db_dump)
    .end(function(e,r){
        should.not.exist(e)
        should.exist(r)
        return cb(e)
    })
    return null
}

To see that in action, I put my various CouchDB-related utilities in a file here, and then my actual test has a before job that creates the CouchDB database and populates it, and a corresponding after task that deletes the temporary database.

CouchDB 2.0 preview day 2

Yesterday I fired up CouchDB 2.0 (well, the lastest git master). Today I wanted to start using it and right away ran into a difference between the old way and the new way.

My test application needs CORS to be enabled. The old way one could fiddle with the config files directly, or fiddle with the config files in futon, or use the handy command line tool from the PouchDB project at https://github.com/pouchdb/add-cors-to-couchdb.

But CouchDB 2.0 by default spawns three nodes, not just one. Therefore fauxton prevents the root user from manipulating the configuration of CouchDB directly, and instead suggests that this task be performed by “a configuration management tools like Chef, Ansible, Puppet or Salt (in no particular order).”

Configuration via configuration management tool

Configuration via configuration management tool

Because the 2.0 release isn’t really done yet, there isn’t much support available in the documentation. I couldn’t find any mention of how to use “Chef, Ansible, Puppet, or Salt” and since I’ve never used them before, I’m not going to get involved for such a simple task.

Instead, I decided to go the manual route, and try to fiddle directly with the config files for each node. In my couchdb directory, I am running the server from the ./dev/ subdirectory. Looking there, I found the following directory tree:

james@emma couchdb[master]$ tree -d dev
dev
├── data
├── lib
│   ├── node1
│   │   ├── data
│   │   │   └── shards
│   │   │       ├── 00000000-1fffffff
│   │   │       ├── 20000000-3fffffff
│   │   │       ├── 40000000-5fffffff
│   │   │       ├── 60000000-7fffffff
│   │   │       ├── 80000000-9fffffff
│   │   │       ├── a0000000-bfffffff
│   │   │       ├── c0000000-dfffffff
│   │   │       └── e0000000-ffffffff
│   │   └── etc
│   ├── node2
│   │   ├── data
│   │   │   └── shards
│   │   │       ├── 00000000-1fffffff
│   │   │       ├── 20000000-3fffffff
│   │   │       ├── 40000000-5fffffff
│   │   │       ├── 60000000-7fffffff
│   │   │       ├── 80000000-9fffffff
│   │   │       ├── a0000000-bfffffff
│   │   │       ├── c0000000-dfffffff
│   │   │       └── e0000000-ffffffff
│   │   └── etc
│   └── node3
│       ├── data
│       │   └── shards
│       │       ├── 00000000-1fffffff
│       │       ├── 20000000-3fffffff
│       │       ├── 40000000-5fffffff
│       │       ├── 60000000-7fffffff
│       │       ├── 80000000-9fffffff
│       │       ├── a0000000-bfffffff
│       │       ├── c0000000-dfffffff
│       │       └── e0000000-ffffffff
│       └── etc
└── logs

39 directories

Clearly, there are three nodes, and each has an etc subdirectory. And find turns up what I’m looking for right where I think it should be:

james@emma couchdb[master]$ find dev -name local.ini
dev/lib/node1/etc/local.ini
dev/lib/node2/etc/local.ini
dev/lib/node3/etc/local.ini

So I loaded each local.ini in turn into emacs and turned on CORS in each

[httpd]
;port = 5984
;bind_address = 127.0.0.1
enable_cors = true
...
[cors]
credentials = false
; List of origins separated by a comma, * means accept all
; Origins must include the scheme: http://example.com
; You can’t set origins: * and credentials = true at the same time.
origins = *
...

You cant just copy node1’s local.ini to all three nodes, because each file contains the node’s UUID. Duplicate (or triplicate!) UUIDs is a little stupid…even I know that.

I restarted the three nodes using dev/run, and then for good measure I downloaded haproxy from SlackBuilds, built it, installed it, then ran

&lt;br /&gt; /usr/sbin/haproxy -f rel/haproxy.cfg
[WARNING] 279/120801 (18768) : config : log format ignored for frontend 'http-in' since it has no log address.
[WARNING] 279/120801 (18768) : Health check for server couchdbs/couchdb1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
[WARNING] 279/120803 (18768) : Health check for server couchdbs/couchdb2 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
[WARNING] 279/120805 (18768) : Health check for server couchdbs/couchdb3 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.

I had to switch the port from 5984 to 5985 in rel/haproxy.cfg because I’m currently running 1.6.x CouchDB on 5984, but the proxy worked. I was also able to ping the proxy from a different machine, because it listens to *, not 127.0.0.1.

james@emma couchdb[master]$ ssh 192.168.0.1 
Last login: Wed Oct  7 12:14:03 2015 from 192.168.0.9
james@kitty ~$ curl 192.168.0.9:5985
{&quot;couchdb&quot;:&quot;Welcome&quot;,&quot;version&quot;:&quot;4ca9e41&quot;,&quot;vendor&quot;:{&quot;name&quot;:&quot;The Apache Software Foundation&quot;}}

I haven’t actually tested whether or not I’ve set up CORS properly. That’s for my next post I guess.

Upgrade CouchDB to 2.0.0 preview/master branch

I was inspired today to try couchdb master, which is more or less the
2.0 preview. I ran into a minor problem that didn’t seem to be
documented anywhere.

I have a repo that I’ve been using to track the 1.6.x patches, and I
just pulled to that, checked out master, and tried to configure.

git pull
git checkout master
./configure

The configure process started to download a lot of stuff using git,
then crashed with a mysterious complaint about an app dir and an app
file missing.

Is it Erlang?

Since I’m on slackware, I tend to compile everything that isn’t
standard Slackware, and the standard SlackBuild for Erlang these days
is 17.4. I know I’ve had trouble with that in the past, so I took a
look at the INSTALL file and then the git logs and saw that the
maximum Erlang mentioned is 17.0. So I downloaded 17.0, compiled it,
and replaced 17.4 with 17.0.

Same problem. ./configure ran much faster, but failed with the same
error.

Is it just me?

I started to get discouraged, feeling like perhaps CouchDB wasn’t
going to let me relax any more. Because the error was in the sub
projects, I poked around the configure and Makefile files, and
didn’t see a way to force the clean checkout. So I just deleted the
problem directory (./src/couch_index) and ran configure again.
Again it crashed, but this time on a different file.

Because I trust git and because it isn’t my project, I just deleted
all of the directories under ./src/ and did a git status. Git
said that all was okay, so clearly none of the stuff under ./src was
under version control.

Rerunning ./configure this time checked out all of the projects, and
completed successfully.

Sadly, at the end of the configure step, I read the words:

Updating lager from {git,"https://git-wip-us.apache.org/repos/asf/couchdb-lager.git",
{branch,"master"}}
Updating bear from {git,"https://git-wip-us.apache.org/repos/asf/couchdb-bear.git",
{tag,"0.8.1"}}
james@kitty couchdb[master]$

Gone is the admonition

You have configured Apache CouchDB, time to relax.

Build

I went ahead and restored Erlang to 17.4, re-ran the configure step,
then ran make. Everything ran smoothly, aside from a minor hiccup
requiring me to run sudo pip install sphinx then make again.

Run

I didn’t want to install the new CouchDB, but rather just wanted to
play with it. Reading from https://couchdb.apache.org/developer-preview/2.0/,
I executed dev/run from the command line after the make completed
successfully. After it fired up the three nodes of the CouchDB 2.0
service (yay, 3 nodes out of the box!), I noted the root user and
password, and hopped over to my browser to http://127.0.0.1:15984/_utils.
The new Fauxton popped up, I logged in, with the root username and password,
and poked around the empty CouchDB.

Of course, not much there, but so it goes.

I haven’t had the guts to try cloning any of my old databases from
CouchDB 1.6.x (that’s for some other day). Instead I satisfied myself
with making a new, non-root user.

Unlike the old version of Futon, there isn’t an obvious place in
Fauxton to add a new user. I also found that the 2.0 docs aren’t
super complete, so I was curious if the old, curl-based method of
adding users (documented here) would work.

I ran the following command:

curl -X PUT http://localhost:15984/_users/org.couchdb.user:james \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{"name": "james", "password": "pluot caramelized muffin breakfast", "roles": [], "type": "user"}'

Curl reported success, and I poked the _users database in fauxton
and saw my new user, with the password hashed properly, of course.  Now I can
log in as “james” rather than “root”.

So the upgrade to 2.0 developer preview is a success. Next I have to
actually test out all the new features.

From simple examples to complicated real world cases

I have a really irritating use-case for a CouchDB view. I have several hundred million documents representing hourly data for 4km grid cells in California, and I need to group them by areas. For example, grid cell i=100, j=223 is in Mendocino County, and in the “NORTH COAST” air basin. Of course I have the geometry of the grid cells and the geometry of the counties, air basins, and so on, in PostgreSQL/PostGIS, and I usually just shoot off a query to get the relationship and I’m done. This is CouchDB, however, and views cannot rely on external information lest they become idemimpotent (I made that up). Everything that the view needs must be in the view from the start.

Fair enough, I set up the SQL queries and generated my 9,800+ row JavaScript hash lookup table that maps grid cell to various areas of interest. Now I want to mix that into the view without pulling my hair out.

There is a really simple example in the CouchDB wiki. I’ve reproduced it below:

 {
   _id:"_design/test",
   language: "javascript",
   whatever : {
     stringzone : "exports.string = 'plankton';",
     commonjs : {
       whynot : "exports.test = require('../stringzone')",
       upper : "exports.testing = require('./whynot').test.string.toUpperCase()"
     }
   },
   shows: {
     simple: "function() {return 'ok'};",
     requirey : "function() { var lib = require('whatever/commonjs/upper'); return lib.testing; };"
   },
   views: {
     lib: { 
       foo: "exports.bar = 42;" 
     },
     test: { 
       map: "function(doc) { emit(doc._id, require('views/lib/foo').bar); }"
     }
   }
  }

So where the above example says foo: "exports.bar = 42;", I want to add in my massive hashtable. Obviously cutting and pasting so many lines is not the way to go. Instead, I’m using a couchapp tool.

The concept of a couchapp used to get more press that it currently seems to, but the basic idea is to use code to load up your design doc with attachments and views. In my case, I couldn’t care less about the attachments and the notion of a webapp stored and served by CouchDB. I just want to programmatically construct the view document, and push it to CouchDB. I chose to use node.couchapp.js. I could also have "rolled my own", and in fact I probably will this afternoon. I am playing around with grunt, so I used grunt_couchapp (after patching it a bit to use cookie based authentication).

The basic structure of my directory is the following


config.json
package.json
Gruntfile.js
app.js
lib
├── cellmembership.json
└── dump_membership.js
node_modules
├── ...
└── ...

The config.json file contains my database details, including my username and password. package.json contains the npm dependencies, mostly containing what was pulled in by the grunt_couchapp tool, and the node_modules directory holds all the node modules. I do not have an _attachments directory, so I make sure my design doc has no attachments!

Before getting to app.js, in which the design document is defined, I will first talk about what goes into it. The lookup table is stored as a JSON object in lib/cellmembership.json. The contents looks like:

{ "100_223":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
 "100_224":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
   ... 9,890 more lines like this ...
 "304_48":{"airbasin":"SALTON SEA","bas":"SS","county":"IMPERIAL","fips":"13","airdistrict":"IMPERIAL COUNTY APCD","dis":"IMP"},
 "98_247":{"airbasin":"NORTH COAST","bas":"NC","county":"HUMBOLDT","fips":"12","airdistrict":"NORTH COAST UNIFIED AQMD","dis":"NCU"}
}

The view code that uses this file is saved to lib/dump_membership.js, and looks like:

module.exports = function(doc){
    var lookup = require('views/lib/cellmembership').lookup
    emit(lookup[doc.cell_id].county, doc.value)
}

These two pieces are put together in app.js, that looks like this:

var couchapp = require('couchapp')
var cellmembership = require('./lib/cellmembership.json')
var mapfun = require('./lib/dump_membership')

var ddoc = {
    _id: '_design/calvad',
    rewrites: [{
      from: '',
      to: 'index.html',
      method: 'GET',
      query: {}
    },{
      from: '/*',
      to: '/*'
    }],
    views: {
        "lib":{
            "cellmembership":"exports.lookup="+JSON.stringify(cellmembership)
        },
        "test":{
            "map":mapfun
        }
    },
    lists: {},
    shows: {}
};


module.exports = ddoc;

So instead of "exports.bar=42;", I put in "exports.lookup="+JSON.stringify(...). The key insight that the simple example didn’t really convey is that you want your entire "library" module to be a string. So in this case that means saving my JSON lookup document as a string using JSON.stringify. I probably could have just loaded it directly using fs.readfile(), but I like this way, because it soothes my worries about malformed JSON. If the JSON is screwed up, the app.js won’t run, and the failure happens right away, not in the midst of cranking through hundreds of millions of documents.

The other bit that I didn’t get from the example was how to include an external function in the design document. What I did was pretty simple, and it worked. I just did "map":mapfun. This is exactly the opposite of what needed to be done with the views:lib:cellmembership.. construct. There the exports.lookup= statement needs to be a string inside of the JavaScript, whereas the assignment of the map function needs to be actual JavaScript code, not the string representation of that code.

This is exactly the kind of inconsistency that drives me nuts and that nobody ever thinks to document, because only crazies like me run into those edge cases.

Take that, cryptic error message

Sometimes when you have a program that works fine for weeks and weeks, it still has bugs that crop up for no apparent reason. Yesterday I ran into that sort of irritating situation, but I learned some stuff and so I’m writing this up so that there is one more possible solution paired to a cryptic error message for the search engines to suck up.

The situation

I am running a geospatial modeling job to estimate variables in time and space. There are a lot of little grids to process, and each needs a model run for each hour. Continue reading

How I fire off multiple R jobs from node.js

Node.js has become my hammer of choice for most systems programming type jobs. In an earlier post I talked about how to use CouchDB to store the progress and state of jobs that need doing. Here I will demonstrate how I trigger those jobs and update CouchDB using a fairly simple node.js program.

Two key features of node that makes this program possible are spawn and being able to read and manipulate the environment variables.

var spawn = require('child_process').spawn
var env = process.env

Node.js is fully capable of using child processes. One can choose from exec, execFile, spawn, and fork. For my usage, the spawn function does exactly what I want—it creates a child process that reports back when it exits.

The other useful tool is the ability to access the current running environment using the process.env variable. This allows my program to take note of any environment variables that are already set, and to fill in any missing variables that my child process might need.

Concurrency via queue

Using spawn one can fire off as many jobs as desired. Suppose you have a machine with four cores, then calling spawn four times will efficiently use your processing power. Unfortunately it isn’t usually that simple. Instead, what typically happens is that you have a lot of separable data crunching tasks that need to be run, and you want to have four data processing jobs running at all times until the work is all done. To accomplish this, the spawn function will need to be called four times (to fill up the processors) and then will need to spawn a new job whenever one of the existing jobs finishes.

Continue reading

Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

CouchDB and Erlang

Typical left-field introduction

As far as I understand it, the ability to run Erlang views natively is likely to be removed in the future because it does not offer any sandboxing of the content, and so the view can execute arbitrary commands on the server. So Erlang views are likely to go away.

Problem: big docs crash JSON.parse()

That said, I have a use case for Erlang views. Continue reading

How big is “too big” for documents in CouchDB: Some biased and totally unscientific test results!

I have been storing documents somewhat heuristically in CouchDB. Without doing any rigorous tests, and without keeping track of versions and the associated performance enhancements, I have a general rule that tiny documents are too small, and really big documents are too big.

To illustrate the issues, consider a simple detector that collects data every 30 seconds. One approach is to create one document per observation. Over a day, this will create 2880 documents (except for those pesky daylight savings time days, of course). Over a year, this will create over a million documents. If you have just one detector, then this is probably okay, but if you have thousands or millions of them, this is a lot of individual documents to store, and disk size becomes an issue.
Continue reading

Iterating view and doc designs multiple times

Just a quick post so that I remember to elaborate on this later.  I have found that whenever I have a large project to do in CouchDB I go through several iterations of designing the documents and the views.

My latest project is typical.

  1. First design was to push in really big documents.  The idea was to run map reduce copy the reduce output to a second db, and map reduce that for the final result.   But the view generation was too slow, I never got around to designing the second db, and the biggest documents triggered a bug/memory issue.
  2. Continue reading