CouchDB and Erlang

Typical left-field introduction

As far as I understand it, the ability to run Erlang views natively is likely to be removed in the future because it does not offer any sandboxing of the content, and so the view can execute arbitrary commands on the server. So Erlang views are likely to go away.

Problem: big docs crash JSON.parse()

That said, I have a use case for Erlang views. I have a database with two really large documents. They are TopoJSON documents of a 4km grid overlaid on California. One grid is in the original projection, and the other is reprojected as best as I can manage to SRID:4326. Both files are about 3.4MB, and apparently they cause intermittent problems for the JS view engine in CouchDB. I can’t quite nail down the issue enough to file a bug report, but if the docs are in there any view can fail. If I delete the TopoJSON docs and then put them back again, views will sometimes pass, but my suspicion is that the fail is due to a very large block of docs getting sent to the view engine that includes the TopoJSON files…the RAM required to parse the JSON is more than the server has available. My guess is that the reason it works when the TopoJSON files are deleted then re-inserted is that they are processed by themselves, or with a much smaller batch of other docs.

Regardless, the issue definitely comes about due to marshalling the data from Erlang to JS and then breaking it up using JSON.parse (that is what the crash dump says). So, I put together a super simple Erlang view, and it passed just fine, while the equivalent JS view failed. Problem solved (at least until the CouchDB devs pull the trigger on stripping out Erlang views).

But now I have new problems…

While I’m good with JavaScript, I’m about as clueless with Erlang as I am helping my daughters buy ballet leotards. But I found out a couple of things that I want to write down here.

First, due to a few helpful resources on the internet, I was able to get a simple view going. A typical record looks like this:

{
   "_id": "129_160_2007-01-01 00:00",
   "_rev": "2-4f3364efb916a4979dc4efe13bc667d1",
   "geom_id": "129_160",
   "i_cell": 129,
   "j_cell": 160,
   "data": [
       "2007-01-01 00:00",
       "280",
       23153.13,
       184.23,
       458.45,
       0.03,
       73.48,
       46.72,
       5.04,
       61.33,
       13.28,
       2.03,
       52.24,
       13.08,
       52.31,
       2,
       "400338",
       "401512"
   ],
   "aadt_frac": {
       "n": 0.02204350388061379,
       "hh": 0.029391644125345406,
       "not_hh": 0.024323281333117733
   }
}

didn’t have much hope because I didn’t know what would happen if I tried to split a string on a token that wasn’t in the string
I wanted a view that would highlight the documents that had abnormally high values for aadt_frac.n. In JavaScript, that is easy, but as I said, JavaScript failed me. In Erlang, how would one access the internal members of the JSON object? The answer is below.

fun({Doc}) ->
  A = couch_util:get_value(<<"aadt_frac">>, Doc),
  case A of
    undefined ->
          ok;
    _ ->
      G = couch_util:get_value(<<"geom_id">>, Doc),
      {[{<<"n">>, N},{<<"hh">>, Hh},{<<"not_hh">>, Nhh}]} = A,
      if
         N > 0.5 -> Emit(G,N),
                    true;
         true -> true
      end
  end
end.

Apparently CouchDB provides a namespace for some Erlang utilities, and one of those will get a value out of a JSON record. That is what the first line is doing: A = couch_util:get_value(<<"aadt_frac">>, Doc) extracts the ‘aadt_frac’ value from the document and stores it in the variable A. Apparently Erlang variables need to be Capitalized, which was weird and may not be true.

I learned about case statements from the Erlang User’s Guide, with a little more help from the two examples in the CouchDB wiki page. If A is undefined, then skip the doc. If it is defined, then I have to extract its contents. The line {[{<<"n">>, N},{<<"hh">>, Hh},{<<"not_hh">>, Nhh}]} = A does this, following the advice from http://stackoverflow.com/a/2422631. The last if statement simply checks if N is greater than 0.5, and if it is, it will emit the grid’s geometry id and the value of N.

This view worked great, and pointed up about 20 or so locations and times with outrageously high values for n, hh, and not_hh. I dealt with these issues in my code, and created new versions of the documents, and then dove into some modeling in R.

My goal is to build a spatial model of these variables for places where I have no data measurements and save the estimates into the same CouchDB, filling in all the empty grid cells. Unfortunately, my first attempt failed—my code had a bug and I wrote blank entries for the aadt_frac record. Luckily I only messed up on a single month for a very small section. Unfortunately, that was still 200,000+ documents I now needed to delete.

The first step was to find these documents using a view. Javascript was out stil, but because I had learned me some Erlang, I could write the following simple view:

fun({Doc}) ->
        T = couch_util:get_value(<<"ts">>, Doc),
        case T of
            undefined ->  ok;
            _ ->
                D = couch_util:get_value(<<"data">>, Doc),
                case D of
                    undefined ->
                        Emit(1,1);
                    _ ->
                        ok
                end
        end
end.

Simple logic: if the document has a timestamp field, it is a document I want to consider (not a geometry file and not a design document), and if that file with a timestamp is missing a data field, then it needs to be deleted, so flag it by emitting something.

Now I bumped up against another problem with CouchDB…how to bulk delete documents. In SQL, I would write a query, verify that it was returning what I expected, double and triple check, and then drop those records from the table in a big query statement. In contrast, CouchDB speaks HTTP, and you must resort to DELETE requests, or POST to _bulk_docs with a list of document ids, revisions, and _delete:true fields. The view above is the first step. It allows me to gather up all of the document ids and revision numbers that I need to purge. The next step is to write code to bundle up those ids and revs, slap on the delete directive, and send the request to the CouchDB bulk_docs interface.

Ordinarily I just hack a little something out in whatever language I am using to get that job done. But I was working in R, and I hate using R for this sort of thing, so I turned to node.js, my hammer of choice these days. Because I’ve done this before, I decided to just write a self-contained package that will allow the deletion of every document that is emitted by a view. And I also wrote tests, to prove that it worked before I unleashed it on my 6GB database. The code is on github. I ran it and deleted the problem docs. Problem solved.

But now I have new problems…

In the never ending saga of many mistakes that is my life, I fixed the aadt_frac generation code, but failed to notice and fix the fact that the _id values being generated by my R were broken, or rather, non existent. A big part of my work flow is knowing that I can query documents directly by their ids, and I just stored a bunch of documents with randomly generated UUIDs for ids. So I turned to Erlang and was faced with a new problem: I couldn’t simply rely on testing for undefined, because all documents have a valid id! Instead I needed to compare that id value against something. A new skill!

My first attempts, after a bit of searching around, failed miserably. A valid id is supposed to be the concatenation of the geom_id and the timestamp, with an underscore separating the two. What I needed was a way to test if the id looked like that. First I tried a split type approach, using the string:token function:

[Icell,Jcell,Ts] = string:tokens(Id, "_")

but that failed. Actually it failed due to a syntax error higher up, but I abandoned the approach anyway…I was trying to split when I really just wanted to compare a substring, so I went with a different function.

Id = couch_util:get_value(<<"_id">>, Doc),
Gid = couch_util:get_value(<<"geom_id">>, Doc),
case string:str(Id, Gid) of
    0 -> Emit(2,1);
    _ -> Emit(3,1)
end;

This one didn’t have a syntax error, but kept failing no matter how I prodded and poked. The logs said:

** Reason for termination ==
** {function_clause,
       [{string,str,
            [<<"b71476cb16533f8570ca909cad3a02dc">>,<<"197_102">>],
            [{file,"string.erl"},{line,102}]},
        {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,572}]},
        {erl_eval,expr,5,[{file,"erl_eval.erl"},{line,250}]},
        {couch_native_process,'-run/2-fun-0-',2,
...

“Hmm” I said to myself, “maybe these aren’t strings.” I mean, the data underneath them is a string, but that doesn’t mean Erlang thinks they are strings. So I revisited the CouchDB wiki page, and saw this snippet in the second filter:

Values = binary:split(ValuesParam, <<",">>, [global]),

Armed with that, I went poking around the Erlang docs and found binary:match, and put together this bit that worked:

Id = couch_util:get_value(<<"_id">>, Doc),
Gid = couch_util:get_value(<<"geom_id">>, Doc),
case binary:match(Id, Gid) of
    nomatch -> Emit(2,1);
    _ -> ok
end;

Once again, I had a view with lots of documents in it, so I passed it to my handy dandy node_couch_view_deleter program, and got rid of them all.

Conclusion

I’ve been telling my daughter to always write conclusions that conclude her papers. Of course I also tell her to write paragraphs with at least 5 sentences in them. Apparently I don’t follow my own advice.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s