More with the GDAL/OGR perl bindings

So my last post talked about my struggles to finally get something saved in the database using the native perl bindings into the GDAL/OGR library. Once I got that working and pushed out the post, I immediately started loading up multiple files and playing around with the data. One thing I noticed was that it was impossible to separate different “trips” within the data without playing around with space and time. What I wanted was an easy way to flag each batch of points with a field identifying the run.

The auto-generated schema for the GPX data looks like this:

d testogr.track_points
                                              Table "testogr.track_points"
       Column       |           Type           |                               Modifiers                                
--------------------+--------------------------+------------------------------------------------------------------------
 ogc_fid            | integer                  | not null default nextval('testogr.track_points_ogc_fid_seq'::regclass)
 wkb_geometry       | geometry(Point,4326)     | 
 track_fid          | integer                  | 
 track_seg_id       | integer                  | 
 track_seg_point_id | integer                  | 
 ele                | double precision         | 
 time               | timestamp with time zone | 
 magvar             | double precision         | 
 geoidheight        | double precision         | 
 name               | character varying        | 
 cmt                | character varying        | 
 desc               | character varying        | 
 src                | character varying        | 
 link1_href         | character varying        | 
 link1_text         | character varying        | 
 link1_type         | character varying        | 
 link2_href         | character varying        | 
 link2_text         | character varying        | 
 link2_type         | character varying        | 
 sym                | character varying        | 
 type               | character varying        | 
 fix                | character varying        | 
 sat                | integer                  | 
 hdop               | double precision         | 
 vdop               | double precision         | 
 pdop               | double precision         | 
 ageofdgpsdata      | double precision         | 
 dgpsid             | integer                  | 
 speed              | double precision         | 
Indexes:
    "track_points_pkey" PRIMARY KEY, btree (ogc_fid)
    "track_points_wkb_geometry_geom_idx" gist (wkb_geometry)

There are three fields that are completely blank: src, desc, and name. I decided to use src to identify the source of the data as the file name it came from.

First I modified my previous program to parse the command line options using Getopt::Long. I don’t use all of its power in this example, but in the past I’ve been well served by starting with that in case the script grows and mutates.

With Getopt::Long, I understand there are ways to input a list of things into the arguments. You can have multiple invocations of the same option, for example, --file mydata.gpx --file moredata.gpx, or you can input them as a comma separated list and follow the recipe in the perldoc for the module. However, I wanted to use a glob, like –file data/*.gpx, so I instead decided to just stick all the files after a double dash on the command line. So really, in the following code, I’m only using Getopt::Long to parse out a –help command! However, it’s there if I need to expand functionality in the future.

use strict;
use warnings;
use Carp;

use Geo::GDAL;
use Data::Dumper;

use Getopt::Long;
use Pod::Usage;

my $man = 0;
my $help = 0;

my @files;

my $result = GetOptions(
    'help|?' => $help,
    ) or pod2usage(2);

pod2usage(-exitval => 0, -verbose => 2) if $help;

@files = @ARGV;
...

With that, I have all of my input files in an array, and I can loop over them and store the filename in the source field in the db by using $new_feature->SetField('src',$_);, as follows:

foreach (@files){

    my $ds = Geo::OGR::Open($_);

    my $layer         = $ds->Layer($layer_name);
    my $feature_count = $layer->GetFeatureCount();
    carp "$layer_name, $feature_count";
    if ( $feature_count < 10 ) {
        next;
    }

    carp "saving $_ to pg";

    # now append each feature
    my $x = 0;
    $pg_layer->StartTransaction();
    while ( my $feature = $layer->GetNextFeature() ) {

        my $new_feature = Geo::OGR::Feature->new($defn);
        $new_feature->SetFrom($feature);

        # write the filename as the src field, for making lines later
        $new_feature->SetField('src',$_);

        my $pgf = $pg_layer->CreateFeature($new_feature);

        $x += 1;
        if ( $x % 128 == 0 ) {
            carp $x;
            # uncomment the following to crash your program
            # $pg_layer->CommitTransaction();
            # StartTransaction() seems to auto commit prior transaction?
            $pg_layer->StartTransaction(); 
            $x = 0;
        }

    }
    if ($x) {
        carp "all done, $x remaining";
        $pg_layer->CommitTransaction(); # this one doesn't crash for some reason
        carp "last transaction committed";
    }
}

That does its magic, and the database now has distinct groups of points. Now if you want to make “lines” out of those points, you can do this in PostGIS:

SELECT ST_MakeLine(wkb_geometry ORDER BY track_seg_point_id ASC) AS linegeom, src
INTO table testogr.lines
FROM testogr.track_points
GROUP BY src;

Et voila

QGIS rendering the new lines table, on top of OSM lines data

QGIS rendering the new lines table, on top of OSM lines data

Of course, that isn’t at all helpful, as I want to see speeds, not just the lines. Next step is to try to figure out how to add a measure to each point, and then collect those (X,Y,M) type points into a line with a measure dimension. I guess that will be my next post.

Using GDAL/OGR perl bindings to load GPX files into PostgreSQL/PostGIS

Today I wrote a short perl program to import GPX files into PostgreSQL using the OGR library’s native perl bindings. This was a super pain to figure out because the naive way doesn’t work, and it appears all the documentation pushed out to mailing lists and on various wikis talks about Python.

OGR has an excellent tool called ogr2ogr that allows you to append data. However, I didn’t want to use that because I wanted to fiddle with the data first, the pipe it to SQL. Specifically, I wanted to delete long pauses at stop lights, etc., and I wanted to use some logic to make sure I didn’t blindly reload old GPX files.

My initial solution was to simply copy the GPX layer in, and then hunt around for a way to flip on an “append” option. My initial program looked like:

use strict;
use warnings;
use Carp;

use Geo::GDAL;
use Data::Dumper;

# Establish a connection to a PostGIS database
my $pg = Geo::OGR::GetDriverByName('PostgreSQL');
if ( !$pg ) {
    croak 'PostgreSQL driver not available';
}

my $conn = $pg->Open( "PG:dbname='osm' user='james' schemas=testogr", 1 );

if ( !$conn ) {
    croak 'choked making connection';
}

my $ds = Geo::OGR::Open('../test/2014-07-10_07-29-12.gpx');

my $pg_layer;
my $defn;

## I'm only interested in the track_points layer
my $layer_name = 'track_points';
my $layer      = $ds->Layer($layer_name);

# use copy
$pg_layer = $conn->CopyLayer( $layer, $layer_name, { 'overwrite' => 1 } );
if ( !$pg_layer ) {
    carp 'failed to copy';
}

1;

That works, but curiously the automatic FID doesn’t automatically increment when using CopyLayer. No matter, I don’t actually use that, because I like creating my own table definitions.

And even if that did work properly, it would only work once. Every other time, that “overwrite” option on the CopyLayer command is going to wipe the table.

Poring over the docs, I didn’t see any option for “append” as was used in the ogr2ogr utility. So I combed through the ogr2ogr source code, and discovered that the “-append” option actually causes the code to create each feature and add it to the existing layer inside of a loop by iterating over the each of the fields in the layer:

    if (papszFieldMap && bAppend)
    {
        int bIdentity = FALSE;

        if (EQUAL(papszFieldMap[0], "identity"))
            bIdentity = TRUE;
        else if (CSLCount(papszFieldMap) != nSrcFieldCount)
        {
            fprintf( stderr, "Field map should contain the value 'identity' or "
                    "the same number of integer values as the source field count.n");
            VSIFree(panMap);
            return NULL;
        }

        for( iField=0; iField < nSrcFieldCount; iField++)
        {
            panMap[iField] = bIdentity? iField : atoi(papszFieldMap[iField]);
            if (panMap[iField] >= poDstFDefn->GetFieldCount())
            {
                fprintf( stderr, "Invalid destination field index %d.n", panMap[iField]);
                VSIFree(panMap);
                return NULL;
            }
        }
    }

So I tried something like that, but for some reason I kept failing to be able to add the new feature to the existing PostgreSQL layer. My broken code looked like:

if ( !$append ) {
    $pg_layer = $conn->CopyLayer( $layer, $layer_name );
    if ( !$pg_layer ) {
        carp 'failed to copy';
    }
}
else {
    if ( !$pg_layer ) {

        # try to get the layer from db
        $pg_layer = $conn->GetLayerByName($layer_name);
        $defn     = $pg_layer->GetLayerDefn();
    }

    # now append each feature
    while ( my $feature = $layer->GetNextFeature() ) {

        my $newFeature = Geo::OGR::Feature->new($defn);

        # Add field values from input Layer
        for my $fi ( 0 .. $defn->GetFieldCount() - 1 ) {
            $newFeature->SetField( $defn->GetFieldDefn($fi)->GetNameRef(),
                $feature->GetField($fi) );

            # Set geometry
            $newFeature->SetGeometry( $feature->GetGeometryRef() );
        }

        # THIS BREAKS 
        my $pgf = $pg_layer->InsertFeature($newFeature);

    }
}

And many variations on that theme, including just trying to directly copy in the feature with $pg_layer->InsertFeature($feature).

The unhelpful error read:

RuntimeError Illegal field type value at /usr/local/lib64/perl5/Geo/OGR.pm line 1473.

I hacked out a little instrumentation around Geo/OGR.pm line 1473, but then I found out that the problem “field type value” changed every time, which made me think I was doing something wrong.

Finally, after giving up twice, I stumbled on an old mailing list posting here. Again, it was in Python, but I read Python well enough to translate into perl without problems. With a little bit of hacking around a buggy call to CommitTransaction(), it worked! My final code looks like:

use strict;
use warnings;
use Carp;

use Geo::GDAL;
use Data::Dumper;

# Establish a connection to a PostGIS database
my $pg = Geo::OGR::GetDriverByName('PostgreSQL');
if ( !$pg ) {
    croak 'PostgreSQL driver not available';
}

my $conn = $pg->Open( "PG:dbname='osm' user='james' schemas=testogr", 1 );

if ( !$conn ) {
    croak 'choked making connection';
}

my $ds = Geo::OGR::Open('../test/2014-07-14_17-56-45.gpx');

my $pg_layer;
my $defn;
my $layer_name = 'track_points';

my $layer         = $ds->Layer($layer_name);
my $feature_count = $layer->GetFeatureCount();
carp "$layer_name, $feature_count";
if ( $feature_count < 10 ) {
    croak;
}
carp "saving to pg";
if ( !$pg_layer ) {

    # try to get the layer from db
    $pg_layer = $conn->GetLayerByName( $layer_name, 1 );
    $defn = $pg_layer->GetLayerDefn();
    carp $pg_layer->GetFeatureCount();
}

# now append each feature
my $x = 0;
$pg_layer->StartTransaction();
while ( my $feature = $layer->GetNextFeature() ) {

    my $new_feature = Geo::OGR::Feature->new($defn);
    $new_feature->SetFrom($feature);
    my $pgf = $pg_layer->CreateFeature($new_feature);

    $x += 1;
    if ( $x % 128 == 0 ) {
        carp $x;
        # leaving this uncommented causes a crash.  Bug?
        # $pg_layer->CommitTransaction();
        $pg_layer->StartTransaction();
        $x = 0;
    }

}
if ($x) {
    carp "all done, $x remaining";
    # curiously, this call to CommitTransaction works okay
    $pg_layer->CommitTransaction();
    carp "last transaction committed";
}
1;

At stage 3 with self-driving cars

I recently wrote that self-driving cars were inevitable and would change nearly everything about our understanding of traffic flow and how the demand for travel (a person wanting to be where he or she is not) will map onto actual trips. We’re planning using the old models, which are sucky and broken, but now they are even more sucktastic and brokeriffic.

Today in the LA Times business section1 an article reports that a “watchdog” group2 is petitioning the DMV to slow down the process of adopting self-driving cars. It struck me that this act is very similar to bargaining, which means we’re at the 3rd stage of grief.

The first stage is denial. “It can never happen.” “Computers will never be able to drive a car in a city street.” Over. Done. Proven wrong.

The second stage is anger. I haven’t seen that personally, but I have seen hyperbole in attacks like “what are you going to do when a robot chooses to kill innocent children on a bus”. A cross between stage one and stage two is probably this article from The Register.

The third stage is bargaining. The linked page above has the example of “just let me see my son graduate”. In this case, we’ve got “slow down to 18 months so we can review the data and make sure it is safe”. While I’m not suggesting we rush to adopt unsafe robot cars, it is interesting to see how quickly the arguments against self-driving cars has moved to stage 3.

I’m keeping an eye out for depression (old gear-heads blaring Springsteen’s Thunder Road while tinkering with their gas guzzling V-8s?) and then acceptance (we’ve got a robot car for quick trips around town, but we also have a driver car for going camping in the mountains).


  1. The link is the best I could find right now, but is exactly the same as the print article 
  2. The group non-ironically calls itself Consumer Watchdog! 

Why is there glitter on the floor?

Glitter

The light bouncing off the chair leg makes the ugly scratches in the floor sparkle like glitter.

I’ve spent many hours thinking about driverless cars, and have even drafted a few blog posts.  With the announcement the other day from Google, and the subsequent flurry of news coverage, it is time for me to join the party and get my thoughts out there.

A prediction

First, my prediction: Self-driving cars will become standard.

Continue reading

quick tests are great when documentation is thin

I have 14,000 odd items that I want to copy from PostgreSQL into CouchDB. While bulkdocs is great, 14,000 is too much. So I want to group the big array into a lot of smaller arrays.

I thought there was a simple function in [lodash](http://lodash.com) that I could use, and remembered having used [groupBy](http://lodash.com/docs#groupBy) in the past.

But the docs are slightly wrong. They imply that the callback function gets passed one argument, the array element, but the usual idiom for these sorts of functions is that they are passed two or three arguments: the array element, the index of the element, and the entire array.

Sure enough that is what is done:

var _ = require('lodash')
var groups = _.groupBy([4.2, 6.1, 6.4], function(num,idx,third) {
                 console.log(num,idx,third)
                 return idx % 2
             });

console.log(groups)

Running this (node test.js) produces

4.2 0 [ 4.2, 6.1, 6.4 ]
6.1 1 [ 4.2, 6.1, 6.4 ]
6.4 2 [ 4.2, 6.1, 6.4 ]
{ '0': [ 4.2, 6.4 ], '1': [ 6.1 ] }

So I can group by massive array into smaller arrays by munging the index.

Dante was like Tupac

This post is totally wrong, so there. Disclaimer ahoy.

So the lovely wife came home from some nutty adult education class with some interesting but completely irrelevant facts. One of them was that Dante apparently finished the Inferno just days before he died. I think not. I think more likely he died, and his krew was trying to get up the scratch for a new stable of horses so they put together some almost finished stuff and just *claimed* that Dante finished it. If Dante had died 1996, for sure he would have been on a giant big screen at this year’s Coachella festival.

When in doubt, use async.queue()

As with many other satisfied users, my goto library for handling asynchronous processing in node.js is the excellent async library. But what works in small doses doesn’t always work for larger problems.

Specifically, a common use pattern for me is to use it to handle checking things in CouchDB. Often I’m too lazy to code up a proper bulk docs call, so I’ll just run off a bunch of queries asynchronously. This evening I was testing some such code out and it was working for test cases with 10 and 100 items, but it fell over with “double callback” errors when I loaded up 9,000+ items to the list.

The problem of course is that async really means async. When you have an array with 9,000 items in it, and you use, say, filter on it like so:

var my_array=[...]
async.filter(my_array,
        function(item,cb){
                check_true_or_false_via_calling_couchdb(item,cb)
                return null
        },
        function(results_array){
                done(null,results_array)
                return null
        })

then what is happening is that filter is firing off as many hits as it can to CouchDB, which in this case is 9000+. This breaks things, with CouchDB shutting down, my SSH tunnels blocking things, etc etc.
The plumbing has gone “higgledly piggedly”, like that old Bloom County punchline.

So instead, use async’s queue:

var filtered_tasks = []
var q = async.queue(function(task,callback){
            filter_grids(task,function(doit){
                if(doit){
                    // keep these
                    filtered_tasks.push(task)
                }// drop those
                return callback()
            })
        },100)
// assign a callback for when the queue drains
q.drain = function() {
    //console.log('all items have been processed');
    cb_alltasks(null,filtered_tasks)
}
var tasks = _.map(grid_records
                 ,function(v,k){
                      var task = {'options':_.clone(config)}
                      task.cell_id = k
                      task.year = year
                      _.extend(task,v)
                      return task
                  })
q.push(tasks)

I chose the concurrency by playing with it. I 10 is too slow (took 25 seconds), 100 takes 9 seconds, and 1000 takes 9 seconds.

From simple examples to complicated real world cases

I have a really irritating use-case for a CouchDB view. I have several hundred million documents representing hourly data for 4km grid cells in California, and I need to group them by areas. For example, grid cell i=100, j=223 is in Mendocino County, and in the “NORTH COAST” air basin. Of course I have the geometry of the grid cells and the geometry of the counties, air basins, and so on, in PostgreSQL/PostGIS, and I usually just shoot off a query to get the relationship and I’m done. This is CouchDB, however, and views cannot rely on external information lest they become idemimpotent (I made that up). Everything that the view needs must be in the view from the start.

Fair enough, I set up the SQL queries and generated my 9,800+ row JavaScript hash lookup table that maps grid cell to various areas of interest. Now I want to mix that into the view without pulling my hair out.

There is a really simple example in the CouchDB wiki. I’ve reproduced it below:

 {
   _id:"_design/test",
   language: "javascript",
   whatever : {
     stringzone : "exports.string = 'plankton';",
     commonjs : {
       whynot : "exports.test = require('../stringzone')",
       upper : "exports.testing = require('./whynot').test.string.toUpperCase()"
     }
   },
   shows: {
     simple: "function() {return 'ok'};",
     requirey : "function() { var lib = require('whatever/commonjs/upper'); return lib.testing; };"
   },
   views: {
     lib: { 
       foo: "exports.bar = 42;" 
     },
     test: { 
       map: "function(doc) { emit(doc._id, require('views/lib/foo').bar); }"
     }
   }
  }

So where the above example says foo: "exports.bar = 42;", I want to add in my massive hashtable. Obviously cutting and pasting so many lines is not the way to go. Instead, I’m using a couchapp tool.

The concept of a couchapp used to get more press that it currently seems to, but the basic idea is to use code to load up your design doc with attachments and views. In my case, I couldn’t care less about the attachments and the notion of a webapp stored and served by CouchDB. I just want to programmatically construct the view document, and push it to CouchDB. I chose to use node.couchapp.js. I could also have "rolled my own", and in fact I probably will this afternoon. I am playing around with grunt, so I used grunt_couchapp (after patching it a bit to use cookie based authentication).

The basic structure of my directory is the following


config.json
package.json
Gruntfile.js
app.js
lib
├── cellmembership.json
└── dump_membership.js
node_modules
├── ...
└── ...

The config.json file contains my database details, including my username and password. package.json contains the npm dependencies, mostly containing what was pulled in by the grunt_couchapp tool, and the node_modules directory holds all the node modules. I do not have an _attachments directory, so I make sure my design doc has no attachments!

Before getting to app.js, in which the design document is defined, I will first talk about what goes into it. The lookup table is stored as a JSON object in lib/cellmembership.json. The contents looks like:

{ "100_223":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
 "100_224":{"airbasin":"NORTH COAST","bas":"NC","county":"MENDOCINO","fips":"23","airdistrict":"MENDOCINO COUNTY AQMD","dis":"MEN"},
   ... 9,890 more lines like this ...
 "304_48":{"airbasin":"SALTON SEA","bas":"SS","county":"IMPERIAL","fips":"13","airdistrict":"IMPERIAL COUNTY APCD","dis":"IMP"},
 "98_247":{"airbasin":"NORTH COAST","bas":"NC","county":"HUMBOLDT","fips":"12","airdistrict":"NORTH COAST UNIFIED AQMD","dis":"NCU"}
}

The view code that uses this file is saved to lib/dump_membership.js, and looks like:

module.exports = function(doc){
    var lookup = require('views/lib/cellmembership').lookup
    emit(lookup[doc.cell_id].county, doc.value)
}

These two pieces are put together in app.js, that looks like this:

var couchapp = require('couchapp')
var cellmembership = require('./lib/cellmembership.json')
var mapfun = require('./lib/dump_membership')

var ddoc = {
    _id: '_design/calvad',
    rewrites: [{
      from: '',
      to: 'index.html',
      method: 'GET',
      query: {}
    },{
      from: '/*',
      to: '/*'
    }],
    views: {
        "lib":{
            "cellmembership":"exports.lookup="+JSON.stringify(cellmembership)
        },
        "test":{
            "map":mapfun
        }
    },
    lists: {},
    shows: {}
};


module.exports = ddoc;

So instead of "exports.bar=42;", I put in "exports.lookup="+JSON.stringify(...). The key insight that the simple example didn’t really convey is that you want your entire "library" module to be a string. So in this case that means saving my JSON lookup document as a string using JSON.stringify. I probably could have just loaded it directly using fs.readfile(), but I like this way, because it soothes my worries about malformed JSON. If the JSON is screwed up, the app.js won’t run, and the failure happens right away, not in the midst of cranking through hundreds of millions of documents.

The other bit that I didn’t get from the example was how to include an external function in the design document. What I did was pretty simple, and it worked. I just did "map":mapfun. This is exactly the opposite of what needed to be done with the views:lib:cellmembership.. construct. There the exports.lookup= statement needs to be a string inside of the JavaScript, whereas the assignment of the map function needs to be actual JavaScript code, not the string representation of that code.

This is exactly the kind of inconsistency that drives me nuts and that nobody ever thinks to document, because only crazies like me run into those edge cases.

Dream big

Robert Longo was a hot artist the year I graduated from college, with
a show called something like “Dream Jumbo: Working the Absolute” that
included an art exhibit at LACMA and a show at UCLA. We bought
tickets and went and it was great. We copied the idea of jumping
people, not painting them quite so large, but capturing the movements
and shadows nonetheless.

A year later I was in Europe, doing the backpack Eurail thing. I had
worked for a year and saved up a little money, enough to buy a used
Minolta. Once I got into the groove of traveling, life pretty much
revolved around looking for Romanesque churches, finding cheap hotels,
and strategically choosing night trains between cities.

I went to Europe with many rolls of film, some negative, some black
and white, but mostly slides. I shot all of it, and eventually had to
buy more. To guard against disaster, I would occasionally spot a deal
at a shop and would develop a batch of exposed rolls.

My past self is envious of my current self, with digital cameras not
needing the bag full of film canisters. Then I shot and shared my
images with close friends and family; now I can shoot and post to the
internet to theoretically share with everybody. I can “develop”
pictures on my laptop, and even shoot movies with my camera.

2013-10-14_01

My current self is envious of my past self, with no responsibilities
except to myself, able to go wherever and do whatever. I took
pictures, went to museums, and looked at old architecture. I played
harmonica in between cars on night trains. I watched my bank account
drain down, and got a cash advance on my credit card.

I haven’t heard anything about Robert Longo in years. He may still be
doing stuff, but I don’t care, and he’s certainly not as hot as he
once was. I take a lot more photographs now, but I don’t draw nearly
as much and I haven’t aspired to be an artist in years.