Using CouchDB to store state: My hack to manage multi-machine data processing

This article describes how I use CouchDB to manage multiple computing jobs. I make no claims that this is the best way to do things. Rather I want to show how using CouchDB in a small way gradually led to a solution that I could not have come up with using a traditional relational database.

The fundamental problem is that I don’t know what I am doing when it comes to managing a cluster of available computers. As a researcher I often run into big problems that require lots of data crunching. I have access to about 6 computers at any given time, two older, low-powered servers, two better servers, and two workstations, one at work and one at home. If one computer can’t handle a task, it usually means I have to spread the pain around on as many idle CPUs as I can. Of course I’ve heard of cloud computing solutions from Amazon, Joyent, and others, but quite frankly I’ve never had the time and the budget to try out these services for myself.

At the same time, although I can install and manage Gentoo on my machines, I’m not really a sysadmin, and I really can’t wire up a proper distributed heterogeneous computing environment using cool technologies like Ømq. What I’ve always done is human-in-the-loop parallel processing. My problems have some natural parallelism—for example, the data might be split across the 58 counties of California. This means that I can manually run one job per county on each available CPU core.

This human-in-the-loop distributed computer model has its limits however. Sometimes it is difficult to get every available machine to have the same computational environment. Other times it just gets to be a pain to have to manually check on all the jobs and keep track of which are done and which still need doing. And when a job crashes halfway through, then my manual method sucks pretty hard, as it usually means restarting that job from the beginning.

Continue reading

CouchDB and Erlang

Typical left-field introduction

As far as I understand it, the ability to run Erlang views natively is likely to be removed in the future because it does not offer any sandboxing of the content, and so the view can execute arbitrary commands on the server. So Erlang views are likely to go away.

Problem: big docs crash JSON.parse()

That said, I have a use case for Erlang views. Continue reading

Using superagent to authenticate a user-agent in node.js, plus a bonus bug!

Summary

This post describes how I use the superagent library to test access to restricted resources on my web server. This is something that I found to take a bit more effort than I expected, so I thought I’d write this up for the greater good.

Context

I am running a website in which some resources are open to the internet, while others require authentication versus our CAS server.

I have been logging into the CAS server using request. But in the interest of trying out different libraries and all that, I decided to rewrite my
method using superagent.

I need to log into the CAS server from node.js because I am writing tests that verify that the protected resources are hidden to non-authenticated users, and available to authenticated ones. Continue reading

CAS validate

My first program pushed up to npm turned out to be a javascript CAS (www.jasig.org/cas) library I wrote for our portal at http://www.ctmlabs.net. The main think holding me up pushing anything to npm was the lack of tests. While I never run tests on packages downloaded from npm (one area where CPAN is definitely better than npm…all tests are run by CPAN on install), I felt that I couldn’t claim that a package was “eligible” for adding to npm until I could prove to myself that it worked like I thought it did.

The tests turned out to be a lot harder to write than I expected. I used Mocha, and excellent test framework, and should, a handy assertion library. But the hard part was getting a session with CAS server to work. Continue reading

More progress figuring out asynchronous programming in node.js

I had an old recursive directory creation program I wrote a while back for a node.js server I’m running, but it never seemed to work right.

Last week I went looking for something on github, and found a gist that seemed to be what I wanted, but it wasn’t. It didn’t understand absolute paths, and split on the ‘/’ which caused it to try to create the directory ”, which failed.

So I retooled my program, and did the work of figuring out why it worked when invoked with one directory, but failed when invoked on a list of directories.

It all gets back to the fact that JavaScript is passing references around, so you have to be careful to protect variables that are used to call functions.

First, here is my final version
Continue reading

Increment by day in JavaScript

I had a need to iterate over days of a year today in JavaScript.

I figured you could just add “one day” to the date, and that turns out to be true.

This is the test program I wrote to make sure the incrementing worked properly.

var year = 2008;

var endymd = new Date(year+1, 0, 1, 0, 0, 0);

var days = 0;
for (var ymd=new Date(year, 0, 1, 0, 0, 0);
    ymd<endymd; 
    ymd.setDate(ymd.getDate()+1)){
    days++;
}
console.log(days);

The answer is 366, which means that leap year was handled correctly.

Updated in 2015 (yes I use my old posts as notes to myself). Edited code to fix a bug, and also to use ymd&lt;endymd as the stopping condition. I used to do ymd.getDate()&lt;endymd, and defined endymd as getDate() from the get go. Probably the same speed, but a little cleaner to look at I think.

More Polymaps GeoJSON layer hacking

A while ago I started using Polymaps as my mapping library. I like it much better than other alternatives. I put together a rather stable map viewer that layers my own GeoJSON layer on top of OpenStreetMap tiles. My GeoJSON layer is rendered on demand and then cached to the file system in a nifty solution using Connect, its Connect’s static file middleware, Connect’s router middleware, and a database hitting service of my own. It all runs on node and is super fine and dandy. Some other day I’ll need to write that up.
Continue reading

Preventing server timeout in node.js

Update 2017-05-11. Updated node source code line link again, and links to documentation.

Update 2016-03-07. New link to node source code line.

Update 2015-10-09.  Not sure why people keep hitting this, but they do.  Apparently node.js docs don’t do a good job explaining that there is a two minute time limit hard coded in there?  I updated the link below to the latest master branch.

Update 2013-10-08. This is an old post but continues to get page views, so clearly it is still a problem. The feature is now documented (see link below) and this post is still correct.

Original post, 2011-03-30

This is something I spent an hour or so trying to track down today, so I thought I’d write it up in the hopes that someone else is spared the trouble.

First of all, I have both web client and server written in node.js. My server is designed such that it first checks for cached versions of URLs, and if the file doesn’t exist, then it hits the database and creates the file. This second step can take a long time, and so I wanted to write a utility script that I could trigger manually to update the cache of files whenever the database changes.

So I wrote the script using javascript and node, but was getting a strange error in which the client would die if the server took longer than two minutes to complete the request. No amount of abusing the code on the client would change this, even though the node.js source code seemed to indicate that no timeout was ever being set on the client socket, and most questions on the internet were about how to limit the timeout, not set it to forever.

Turns out the suspicious setTimeout( 2 * 60 * 1000 ) in http.js at line 986 was indeed the culprit. I originally ignored that line, as it was only setting the timeout for the server-side socket. But then, after editing that line in the code and recompiling (grasping at straws), re-running the client using the recompiled node and still getting exactly 2 minutes for the socket to die, it suddenly hit me that my server was timing out, not my client!

So with a single undocumented call inside of the handler in question, I had no more troubles:

res.writeHead(200, { 'Content-Type': 'application/json' });
res.connection.setTimeout(0); // this could take a while

Note the second line above. While the 0.4.4 API docs don’t state this fact, the http response object exposes the socket as the connection object (I found this on the mailing list in this thread http://groups.google.com/group/nodejs/browse_thread/thread/376d600fb87f498a). So res.connection gives a hook to the Socket’s setTimeout function, and setting that to zero drops the default 2 minute limit.

November 2012 I’m still doing this in node.js 0.6.8+ 0.8.x, setTimeout is still part of net, and the http server is still by default using a 2 minute timeout. And github is still awesome.

August 2014 update. Yes, still there: 2 minute timeout. Really this isn’t a bug that needs fixing, because who wants a server to go away for two minutes in these days of attention deficit disorder web surfing. But I wish it was documented in the docs. Apparently this behavior will change soon: https://github.com/joyent/node/issues/4704

And with the release of 0.10.x, but it is now documented. See server set timeout and response set timeout.

When I modify my own code to use 0.10.x, I will put up a new post.
No actually, apparently I never got around to putting up a new post on using the now documented timeout functions.

October 2015March 2016May 2017. Still there, but now you can find the offending hard-coded two minutes in the node source code here. The documentation links above now link to the version 7 API, but are still reasonably accurate for older (v6, etc) versions of node.

Development server logs during development

In a prior post trumpeting my modest success with getting geojson tiles to work, I typed in my server address, but didn’t make it a link. That way robots wouldn’t automatically follow the link and my development server wouldn’t get indexed by Google indirectly.

What is interesting to me is that I still get the occasional hit from that posting. And this is with the server bouncing up and down almost continuously as I add functionality. Just now I was refactoring the tile caching service I wrote, and in between server restarts, someone hit my demo app.

And the GeoJSON tiler is coming along. In making the caching part more robust, I added a recursive directory creation hack which I explain below.

Continue reading