Musing on summer tarts and cobblers

Two weeks ago I made a blueberry and nectarine cobbler, more or less sticking to the recipe from Thomas Keller’s book Ad Hoc at Home. My only variation was that I added nectarines too, not just blueberries. It was terrible; in my opinion the worst fruit cobbler I’ve ever made. The “cobbler” part became a gross, soggy layer of cake-like stuff on top of a too-thin layer of fruit. On the one hand, perhaps my pan was too big and the fruit spread out too much, but on the other hand, if the pan was too big, why did the topping (which was supposed to come out like individual dumplings) glob together into a single surface? Sucky recipe, bad quality control on the cookbook authors’ part, thereby reinforcing my dislike of celebrity chefs and their vanity cookbook projects.

Anyway, that disaster got me thinking about making another blueberry and nectarine cobbler. While I usually go for b&n pie with a proper crust, the time constraints of yesterday’s dinner party precluded putting in the time to make the crust. And Brooke wanted a cobbler.

So I started thinking what would make a good cobbler topping, and I remembered the success I had a long time ago making the caramel topping on pecan rolls. The basic idea is to press half a stick of butter into a cake pan, then layer on a cup or so of brown sugar. As the pecan rolls bake in the oven, the butter and sugar turn into caramel and infuse the pecan rolls with sticky goodness.

So I raided the fridge for some butter and discovered (horrors) that all I had left was a little blob of unsalted butter. But I also spied some clarified butter in a little container. Good enough, so I mixed the two and pressed them into the bottom of my cake pan. Being the good cook that I am, I licked the butter off my fingers—and discovered that the clarified butter wasn’t clarified butter, but rather left over butter-sage sauce!

It’s a funny thing but I am actually a pretty good taster of food (although I am not a very good taster of wine) (or else maybe I just drink a lot of swill) (but I digress). As I tasted the butter, I definitely tasted the sage, and I decided I was okay with that, but I could also taste a hint of garlic, and I was not okay with that. Since I had just crushed and chopped garlic for the sizzling shrimp I was going to make, I really had to think about whether it was my tongue tasting the garlic or my nose smelling it, and that gave me time to think about how the sage would work with the fruit.

I decided the garlic really was in the butter, and it had to go (actually I just added it to the oil I was going to use for the shrimp), and I grabbed a fresh stick of, sadly, salted butter. But I also decided that I really wanted the sage, so I trucked out to the garden to grab some sage leaves. My sage plant from several years got uprooted and didn’t survive this spring’s planting, so all I have is a variegated sage plant with lots of very small leaves. Still good, but I wanted the visual of the leaves, not just the flavor. Then I saw the lemon verbena plant we have growing next to the sage that we intended to use for tea but instead just let it grow. I remember Emma made some fruit dessert once—poached peaches I think—with lemon verbena in the sugar syrup, so I grabbed about 10 nice looking leaves along with the sage.

After washing all the leaves, I placed them in a sunburst pattern on top of the brown sugar I had pressed into the thick layer of butter. Then I added about a quarter of Julia Child’s apple crumble recipe crumble on top of the leaves so that I couldn’t see them any more, and then I tumbled alternating layers of nectarines and blueberries on top of that. Finally, when the fruit was about to the top of the cake pan, I topped it with the rest of the crumble topping (one cup oats, half cup flour, 6 tbsp butter, pinch of salt, 3/4 cup brown sugar, bzapper in the cuisinart to mix) and pressed it down firmly to make a solid layer of sugar-butter-oats.

My idea was to bake it for about an hour at 350 until I could see the caramel bubbling up the sides, and until I could see the fruit begin to bubble through the topping. Then I was going to flip the whole mess onto a big plate, so that the caramel and leaves ended up on top, and the crumble ended up on the bottom like a tart crust.

The results were visually disastrous, but the flavors were great. The few sage leaves really spiked the sugars and flavors of the fruit, and the lemon verbena added a hint of “mystery flavor” that is always fun in a dessert. The crumble crust didn’t add much for me, however, and I don’t think I’ll do that quite like that again.

Unfortunately, I used a completely wrong pan for the cake pan. I actually used a removable bottom pan, which was pretty stupid because a lot of the caramel seeped out onto the baking sheet (I’m not that stupid) rather than bubbling up the sides. And after I flipped the whole thing onto the serving platter, I realized this was just like a tarte tatin, and I could have made it in a cast iron skillet with a pie crust bottom.

So I’m going to make this again, but this time:

  1. use a cast iron pan
  2. maybe put the lemon verbena and sage leaves down first, then the butter, then the sugar, so that the leave show
  3. perhaps a graham cracker crust on top, so it holds together a bit more than the crumble, and gives a bit more crunch
  4. or else perhaps a puff pastry topping that becomes the bottoming, because how cool is it to have crispy puff pastry at the bottom of a oozy drippy fruit tart?

The best part about this dessert was its reception. I had a small serving and really liked the flavor, which is rare for me (I usually just eat my cooking rather than enjoy the flavors). After the first round there was about half the dessert still left on the plate. I mentioned that it looked like we hadn’t really made a dent in the dessert, and suddenly all the adults said they’d like more. In this day and age of low carbs and healthy eating, that’s a resounding success. Finally, when we were cleaning up, there was a very small serving left. I said—hah, we almost finished it!, whereupon Marc asked for a fork and finished it off right from the serving platter. A dessert that is all gone the night it was served is the best kind of dessert, in my opinion.

But while the flavors were great, there is room for improvement, and I have inspiration for more tarts and crumbles.

Using npm with R is great

A few weeks ago I wrote up how I am using npm, the standard package manager for node.js. When I first started using node.js, and when npm first started cropping up as the best package manager to use, I was annoyed by the idea of copying libraries over and over again into each package’s node_modules directory. It seemed wasteful, since they were mostly copies, so I would generally use the -g flag and install globally.

Then I ran into trouble with versions, and I decided my insistence on using -g was stupid. Better to have the required version locally installed than to fight with multiple versions of a package at the global level.

The point is that today, in R, I need to depend on readr but the github version, not the CRAN version, because I need to match a column of times that use “AM/PM” time. In R, there isn’t a clean way to load conflicting versions of a package that I am aware of. I don’t want my programs to use the bleeding edge of readr, but I am willing to accept the devel version for this package.

Unfortunately, I’m the only person using npm to load R packages local to my project. Phooey. But I can hack my R launching script to use devtools to load the package I need locally as follows.

First, I have a standard incantation to make my runtime R find my local, node_modules-installed R libraries:

## need node_modules directories
dot_is <- getwd()
node_paths <- dir(dot_is,pattern='.Rlibs',
                  full.names=TRUE,recursive=TRUE,
                  ignore.case=TRUE,include.dirs=TRUE,
                  all.files = TRUE)
path <- normalizePath(node_paths, winslash = "/", mustWork = FALSE)
lib_paths <- .libPaths()
.libPaths(c(path, lib_paths))

This bit of code will dive down into the local node_modules directory, recursively find all of the .Rlibs directories, and prepend them to the runtime .libPaths, so that local libraries take precedence over global ones.

All I have to do is to insert a command to load the required devel-level packages before installing and testing my code. Something like:

## need node_modules directories
dot_is <- getwd()
node_paths <- dir(dot_is,pattern='.Rlibs',
                  full.names=TRUE,recursive=TRUE,
                  ignore.case=TRUE,include.dirs=TRUE,
                  all.files = TRUE)
path <- normalizePath('node_modules/.Rlibs', winslash = "/", mustWork = FALSE)
if(!file.exists('path')){
    dir.create(path)
}
.libPaths(c(path,node_paths, lib_paths))
vc <-  list(op=">=",version=package_version("0.1.1.9000"))
if(!requireNamespace(package='readr',versionCheck=vc)){
    devtools::install_github('hadley/readr')
}

I can save that as Requirements.R, and then add the following to my package.json file:

...
  "scripts": {
      "test": "/usr/bin/Rscript Rtest.R",
      "preinstall": "/usr/bin/Rscript Requirements.R",
      "install":"/usr/bin/Rscript Rinstall.R"
  },
...

That works and is cool, but extremely one-off. Better would be to add dependencies in the package.json and get them loaded automatically. My unfinished start at this is to create an entry “rDependencies” in the package.json, which npm will then expose to my script in the system environment as “npm_package_rDependencies_…”. But I have to move on and so this is unfinished as of yet:

package.json

...
  "dependencies": {
      "calvad_rscripts": "jmarca/calvad_rscripts",
      "rcouchutils":"git://github.com/jmarca/rstats_couch_utils.git",
      "configr":"git://github.com/jmarca/configr.git"
  },
  "devDependencies": {
    "should": "^6.0.1"
  },
  "rDependencies":{
      "readr":"0.1.1.9000"
  }
  "scripts": {
      "test": "/usr/bin/Rscript Rtest.R",
      "preinstall": "/usr/bin/Rscript Requirements.R",
      "install":"/usr/bin/Rscript Rinstall.R"
  },
...

script snippet to read package.json dependencies

## ideally I would plumb versions from package.json environment variables?

envrr <- Sys.getenv()
print(envrr)
dependencies <- grep(pattern='npm_package_rDependencies'
                    ,x=names(envrr),perl=TRUE,value=TRUE)
print(dependencies)
pkgs <- strsplit(x=dependencies,split='npm_package_rDependencies_')
print(pkgs)
for(i in 1:length(dependencies)){
    pkg <- pkgs[[i]][2]
    ver <- envrr[[dependencies[i]]]
    vc <-  list(op=">=",version=package_version(ver))
    print(vc)
    if(!requireNamespace(package=pkg,versionCheck=vc)){
        print('need to download')
        devtools::install_github(paste('hadley',pkg,sep='/'))
        ## whoops, need to add proper github user, repo name here
    }else{
        print(paste('got',pkg,ver,'already'))
    }
}

Really I need to specify the required development R package like:

  "rDependencies":{
      "readr":{
          "repo":"hadley/readr",
          "version":"0.1.1.9000"
      }
  },

But the hacking gets uglier and uglier because this is passed to the script as npm_package_rDependencies_readr_repo and npm_package_rDependencies_readr_version
which means my braindead regexpr and split calls will need to be tweaked and patched some more to combine the repo and the version with the package.

So, future me, you have work to do and another blog post when you get this cleaned up.

Modernizing my approach to my R packages

I’ve been using R since 2000 or so, probably earlier, off and on. I’ve always just hacked out big old spaghetti-code programs. More recently, as alluded to with this past post, I’ve been migrating to using node.js to call into R. The initial impetus was to solve a problem with memory leakage, and with a single crash disrupting a really big sequence of jobs. By setting up my big outer loops in node.js, I can now fire off as many simultaneous R jobs as my RAM can handle, and if any die, node.js can handle the errors appropriately.

The one remaining issue is that my R code was still pretty brutish. I dislike the formal R packaging stuff, and I wanted something more lightweight, more like what node.js uses. I first tried to use component, but that was the wrong choice for a number of reasons. Way back in October I toyed with the idea of using npm to package up my R code, but I didn’t really start to do that in earnest until very recently. It turns out, with just a few quirks, this works pretty well. This post outlines my general approach to using npm to load R packages.

Continue reading

Another note to my future self on DBIx::Class

I’ve been writing a lot of javascript, and I really like node.js. I like lots of languages, but I find that node.js tends to work how I expect.

That said, sometimes I need to use perl. Last week, after some searching and testing out libraries, I was generally dissatisfied with the node.js packages available for parsing spreadsheets. The node.js way is to be non-blocking and streaming, but I couldn’t find a package that handled old and new spreadsheets that was either non-blocking or streaming (or both). Faced with that, I’d much rather use the tried, true, and extremely well tested Spreadsheet::Read perl module. It is also blocking, but at least it is pretty much guaranteed to work.

So using perl to parse a spreadsheet means I also had to dust off my database code to put the parsed results into my database. Since my last round of perl programming, I’ve gotten much more diligent about testing things as I hack, and writing much smaller modules. So I’m writing a small module to save a list of data to the database. Pretty simple with DBIx::Class.

Creating a test database from Perl

One wrinkle came in testing my code. What I normally do in node.js (with mocha) is to write a little “before” script that creates a database, and then a little “after” script that tears it down. Then all the testing code can write and delete without worrying about bombing the production db, and without requiring me to manually create and delete databases.

The missing link for me (and the purpose of this blog post) was how to create a database and slot in tables from perl and DBIx::Class.

My final solution is a hack of sorts. Instead of being creative, I just dropped down to DBD::Pg and issued a “create database” command directly. My code looks like this:

# create a test database

use DBI;

my $host = $ENV{PGHOST} || '127.0.0.1';
my $port = $ENV{PGPORT} || 5432;
my $db = $ENV{PGTESTDATABASE} || 'test_db';
my $user = $ENV{PGTESTUSER} || $ENV{PGUSER} || 'postgres';
my $pass =  '';

my $admindb = $ENV{PGADMINDATABASE} || 'postgres';
my $adminuser = $ENV{PGADMINUSER} || 'postgres';


my $dbh;
eval{
    $dbh = DBI->connect("dbi:Pg:dbname=$admindb", $adminuser);
};
if($@) {
    croak $@;
}
my $create = "create database $db";
if($user ne $adminuser){
    $create .= " with owner $user";
}
eval {
        $dbh->do($create);
};

That works fine, and is mirrored at the end of the test with a similar $dbh->do("drop database $db"); statement. Sadly, I can’t remember how to do before and after type blocks in perl tests. I seem to remember doing them long ago, but the semantics escape me. Like the subjunctive tense in Italian.

Creating test tables using DBIx::Class

Now the next step that tripped me up was populating the few tables I need for the tests. I have a large crufty db, and lazily used an automated script to create my DBIx::Class schema from the existing PostgreSQL tables. But running $schema->deploy() didn’t work because I have views and so on that muck things up. I really only need two tables for my current spreadsheet data save tests, so I only wanted to deploy() those two tables.

The documentation says:

Additionally, the DBIx::Class parser accepts a sources parameter as a
hash ref or an array ref, containing a list of source to deploy. If
present, then only the sources listed will get deployed.

That’s great, but I couldn’t find any examples of exactly what that meant. So I tried a few things, and one thing worked, and so here I am writing a note to my future self (and anyone else who lands on this page).

My database has multiple postgresql schemas, and so my DBIx::Class schema generation script took that into account. That needs its own documentation, but essentially what I did was:

{"schema_class":"Testbed::Spatial::VDS::Schema",

 "connect_info":{
     "dsn":"dbi:Pg:dbname=spatialvds;host=localhost",
     "user":"myuser"
 },
 "loader_options":{
     "dump_directory": "./lib",
     "db_schema": ["public","hsis","wim","newctmlmap","tempseg"],
     "debug":false,
     "moniker_parts":  ["schema", "name"],
     "moniker_part_separator":  "::",
     "naming": {"ALL":"v8", "force_ascii" : true}
 }
}

The super long class name of Testbed::Spatial::VDS::Schema is cruft from the distant past, but not too difficult to work with. The upshot is that my schema definitions are buried in a directory structure rooted at ./lib/Testbed/Spatial/VDS/Schema/Result/[Public,Hsis,Wim,...]. The two packages that I want to deploy for my tests are called Testbed::Spatial::VDS::Schema::Result::Public::WimStatus and Testbed::Spatial::VDS::Schema::Result::Public::WimStatusCodes.

So.

To deploy just these two tables, I first wrote the fully qualified package names as the “sources” option. But that didn’t work. Then I remembered that when using DBIx::Class, usually you just refer to the different classes (representing tables) by everything after the top level package name. So in this case, I could drop the Testbed::Spatial::VDS::Schema part in both names. My final, working bit of code is:

## deploy via DBIx::Class

use Testbed::Spatial::VDS::Schema;

my $schema = Testbed::Spatial::VDS::Schema->connect(
    "dbi:Pg:dbname=$db;host=$host;port=$port",
    $user,
    );

## deploy just the tables I'm going to be accessing during testing

my $deploy_result;
eval{
    $deploy_result =  $schema->deploy(
        { 'sources'=>["Public::WimStatus",
                      "Public::WimStatusCode"]});
};
if($@) {
    carp 'test db deploy failed';
    croak $@;
}

One final note to my future self. I never like passing passwords around in my programs. What I’ve found is that PostgreSQL uses a .pgpass file, documented here. So as long as the username, host, and database name match one of the lines in that file, it will pull out the correct password. Because this file is chmod 0600, it is less likely to get accidentally read by someone else, and also it will never get slurped up into a git repository. Because perl uses the PostgreSQL C libraries, it automatically inherits this behavior. So with Postgres, you should never be putting passwords into command lines or environment variables or source code.

Non-obvious fix to a dzil problem

Last Friday I decided to skip trying to use node.js to parse spreadsheet files and instead stick with my existing perl solution based on Spreadsheet::Read. Because the code was really old, I had no proper tests, so I just started over from scratch. Poking around Modern Perl 2014 I found a note about using dzil to setup packages.

So I followed along with the choose-your-own-adventure style documentation at http://dzil.org/tutorial/start.html and had good success with setting things up. I rewrote my old code using Moose and immutable state and all that great stuff, and wrote pretty thorough test coverage of the various conditions and edge cases I can think of at the moment.

All was going well until I tried to set up the [AutoPrereqs] plugin.
When asked to compute the dependencies automatically, dzil choked on my binary spreadsheet files stored in the ./t/files directory:

ParseStatusSpreadsheeets[master]$ dzil listdeps
Could not decode UTF-8 t/files/07-2009.xls; filename set by GatherDir 
(Dist::Zilla::Plugin::GatherDir line 215); encoded_content added by 
GatherDir (Dist::Zilla::Plugin::GatherDir line 216); error was: utf8 
"xD0" does not map to Unicode at /usr/lib64/perl5/Encode.pm line 176.

I can easily set the [Prereqs] configuration to list my dependencies manually, but I wanted to do it automatically. I couldn’t believe I was the only person to have binary files mucking up the AutoPrereqs plugin, but the documentation was not helpful at all. The only hints given were to use a custom FileFinder (), but no help on what exactly a FileFinder was or how to set one up in the config file.

Eventually I searched for “binary file” in the Github issues, and found this old bug: https://github.com/rjbs/Dist-Zilla/issues/407. The solution I found there is to tell dzil to ignore files as being binary by listing them in the config file. In my case, that fix works out to be:

[Encoding]
encoding = bytes
match    = xls    ; these are all spreadsheet test files

[AutoPrereqs]

(Another thing not mentioned in the dzil docs is that when they say that “match” is a regex, they don’t mean that you should write match = /xls/i because that won’t work! The config file isn’t perl, it is text that gets manipulated by perl.)

Anyway, with that fix to my dist.ini file, the AutoPrereqs plugin works as expected:

ParseStatusSpreadsheeets[master]$ dzil listdeps
Carp
Data::Dumper
DateTime::Format::DateParse
DateTime::Format::Pg
ExtUtils::MakeMaker
Moose
namespace::autoclean
Spreadsheet::Read
strict
Test::More
warnings

Obsessed with cases

Ever since I bought a Lemolo Daypack, I’ve become mildly obsessed with quality bags and cases. Part of the reason was that I was looking for a decent, good looking pair of bike panniers. There isn’t a large market for panniers, so they’re a little bit hard to find. And when I did find them they tended to be made out of waterproof nylon with plastic buckles and clips. As I searched the internet, I slowly found lots of beautiful panniers, and a lot more beautiful bags and wallets. (In the end I bought the Lemolo Daypack, and a Lemolo Toolroll, but I still don’t have a decent pair of panniers.)

So that brings me to this evening’s entertainment. Today I pulled the trigger and ordered a new iphone. So now I need a case for it that doesn’t look like a clunky chunky clippy thing my daughter would have liked when she was 3. Mind you, I haven’t touched any of these cases, so I know nothing at all about whether I really want to plunk down my cash on these, but I thought I’d collect today’s bookmarks and thoughts in one place for posterity.

First off, carryology is a great resource, but as of right now, if you type iphone6 in their search tool you get back zip.

Second off, I haven’t owned a phone since my tiny sony-ericsson phone from 2002 (or thereabouts).

Third, my reason for wanting a case (your reasons will be different, of course) is that I like to toss my things in my backpack (my Lemolo Daypack, of course) or in my pockets with my keys and change and other abusive items. The case will need to prevent casual scrapes, scratches, and dings, and will need to guard against the occasional busted fountain pen. A bonus would be if the case could protect against the occasional drop, but the more common role will be minimizing wear and tear. Pretty much I’m thinking leather, but now that I think more about it, it might also be possible to do this with heavy duty canvas.

After searching for “leather handmade iphone6 case” and clicking through various links and references (again, carryology articles were great), I’ve lumped my choices into two broad categories: a sleeve design, or a bill fold approach. There is a third category of a shell or backing, such as Apple’s own leather case, but I don’t really see the point of that for my purposes. To protect the phone properly from getting dinked in a big pack or in a pocket, a backing has to be a bit bulky. Apple’s good looking leather backing leaves the screen wide open to scratches and dings.

Sleeves

A sleeve design has the advantage of being simple. You slide the phone in to keep it save, and slide it out to use it. If you slide it in upside down, you can get access to the headphone port and listen to music. The problem with a sleeve is that it looks like it can be quite tricky to design the sleeve properly. The phone has to slide in easily, but not so easily that it slips out accidentally. It has to say securely in the pouch, but must be easy to snag with just the tip of a finger. An errant, fat fingered tug is going to send the phone for a tumble to the floor. My guess is that the more expensive products tend to incorporate slight variations in the design to allow for easy access, as well as tight tolerances to make sure the phone fits just right. For example, the Judas has a slight notch in the top edge of the leather, probably to make it easier to grab the phone. the filzstuek has wool felt lining that will make it easier to slide the phone than raw leather. And the makr is a little hard to figure out without a picture with a phone in it, but I suspect that the two leather tabs at the top open up slightly below the top of the phone to allow for easy access.

Some excellent examples of sleeves:

There are many many more examples on Etsy to fit every taste and budget.

Wallets

The wallet or bill fold style are the other approach I’ve been considering, and some of these apparently allow complete access to the phone’s functionality without having to extract the phone from its case.

  • The Grovemade cases look great. The cases are made of leather and wood. The leather wraps around the wood, and apparently even provides a handy stand, as shown in the pic below.
    the grovemade leather cover doubles as a stand

  • Pad & Quill offer two styles of cases that I’m considering. The Luxury Pocket Book is similar to the Grovemade ones, in that they are also made of wood and leather, and the leather wraps around the front of the phone and folds back for easy access to the phone’s active surface. However, unlike the Grovemade, the leather does not appear to create a stand. Still a very good looking case. My only quibble with the design is that fake bookmark.
    the pad&quill case looks like a little notebook

  • Pad & Quill also offer an all leather case called the Bella Fino. The ad copy appears to claim that they are using a nifty sticker tech to keep the phone stuck to the leather case. I’d like to see this and hold it in my hands to figure out what it does and whether it lives up to its claims, but it certainly looks like it makes for a low profile case.
    the pad&quill bella fino case

The deal breaker

When I was in Japan last spring, I had some time to kill in a department store while my daughters and wife inspected the luxe toilets. We were in the Ginza area of Tokyo, and this department store was holding a leather artisan event. I wandered around, and was really impressed by the beautiful wallets and card holders. I stood a bit too long in front of one gentleman’s stand, and he broke out some English that was better than my phrasebook Japanese. I really liked his card holder, but then my brain snapped on properly and I realized that 22,000円 was roughly $220! I politely asked for his card and shuffled away before I blew my travel budget.

When I got home, the cards and other souvenirs got dumped into a pile and forgotten. But a few weeks ago I came across this guy’s card, and went to his website at munekawa.jp. Now I’m stuck, because what I really want is something like this wallet or this one but sized a bit smaller to fit the iphone properly (they are a bit big). But I really don’t have $300 to spend on a wallet.
Can you imagine an iphone hiding in this wallet that looks like an envelope?

So the search continues. I’ve got a few weeks until the phone arrives. Perhaps I’ll find something when we’re wandering around NYC in 2 weeks.

Cast on

Over the weekend I cast on for a cowl using some yarn my sister got me from Germany. Cast on 300 stitches on my new circular 4.5mm needles (also from my sister’s trip), joined in the round, and started knitting in a 1×1 rib. My sister liked a cowl that she saw in this shop, knit on the same needles and with the same yarn, but the instructions she got (jotted down in German) were to cast on 78 or so stitches and then knit up in garter stitch. The problem with that is then at the end I’d have to graft together the beginning and the end, and the fact that I hate back and forth knitting in garter stitch—too boring for words.

My goal is to do a round or two a day, so 300 to 600 stitches per day.

I’m surprised how weak my hands are. Pinkies and ring fingers on both hands are griping about being sore.

A real-world use of PL/Perl

Last week I wrote a node.js program to parse and copy a CSV file into PostgreSQL. The data included several columns of detector data, and then a catch-all column called XML that was supposed to contain the raw read from the detector. The XML column was a big old ASCII escaped blob of text, and I just ignored it and stuffed it into its own table.

Unfortunately, as is always the case with these things, the XML column wasn’t XML at all. Instead, it contained what looked like a Perl object dumped using Data::Dumper. I couldn’t easily rewrite my node.js program to break up that Perl object, and I certainly didn’t want to rewrite my well-tested node.js program in Perl.

Enter PL/Perl.

I’ve never really had a need for PL/Perl. The PostgreSQL documentation page promotes the ability to use Perl’s string-munging facilities. But here I had an even simpler use case. I just want to call out to Perl, eval() the object, then stash the results.

The reason I’m writing this post is that I’ve never quite gotten the hang of how to use stored procedures in PostgreSQL. This is sort of a “note to my future self” in case I forget containing some of the things I figured out.

First, the initial program I wrote looks like this:

CREATE OR REPLACE FUNCTION perl_xml_segment_decoder (TEXT) RETURNS bt_xml_segment AS $$
    use strict;
    my $unescape = sub {
        my $escaped = shift;
        $escaped =~ s/%u([0-9a-f]{4})/chr(hex($1))/eig;
        $escaped =~ s/%([0-9a-f]{2})/chr(hex($1))/eig;
        return $escaped;
    }; # borrowed from  URI::Escape::JavaScript 

    my $chars = $unescape->( $_[0] );
    my $VAR1;
    eval($chars);

    # clean up some entries we are not using
    my $segment = $VAR1->{'segment'};
    $segment->{'ts'} = $segment->{'Timestamp'};
    my %bar = map { lc $_ => $segment->{$_} } qw{
      SegmentID
      FromLocationID
      ToLocationID
      Route
      GroupBy
      ProjectID
      ts
      NumTrips
      Speed
      Distance
      EstimatedTimeTaken
      TravelTime
    };
    return \%bar;
$$ LANGUAGE plperl;

This takes in one of the “XML” strings, and returns a column type bt_xml_segment that is defined by:

CREATE TABLE bt_xml_segment (
  segmentid      integer primary key,
  fromlocationid integer REFERENCES bt_xml_location (locationid),
  tolocationid   integer REFERENCES bt_xml_location (locationid),
  route          varchar(128),
  groupby        integer,
  projectid      integer REFERENCES bt_xml_project (projectid),
  ts    timestamp with time zone not null,
  numtrips       integer,
  speed          numeric,
  distance           numeric,
  estimatedtimetaken numeric,
  traveltime         numeric
);

One thing I’ve never gotten the hang of is how to call functions. Following the docs, I can call this function as follows:

select * from  perl_xml_segment_decoder('%24VAR1%20%3D%20%7B%0A%20%20%27location%27%20%3D%3E%20%7B%0A%20%20%20%20%27Active%27%20%3D%3E%201%2C%0A%20%20%20%20%27LastCheckin%27%20%3D ... %20%20%27TravelTime%27%20%3D%3E%20%27356.285714285714%27%0A%20%20%7D%0A%7D%3B%0A');

and I would get back a lovely tabular output like this:

 segmentid | fromlocationid | tolocationid | route | groupby | projectid |           ts           |  numtrips |      speed       | distance | estimatedtimetaken |    traveltime    
-----------+----------------+--------------+-------+---------+-----------+------------------------+----------+------------------+----------+--------------------+------------------
      4558 |           3481 |         3472 | SR-39 |      15 |       672 | 2014-07-15 17:30:00-07 |       14 | 8.04274565301844 |      0.8 |                 86 | 356.285714285714
(1 row)

But the semantics of that call are strange to me. What the query says is to treat the function like it is a table. This is reasonable, but what I want to do is call the function on each row of another table, like so:

select perl_xml_segment_decoder(xml.data) from perlhash as xml;

But that returns an array output:

                                      perl_xml_segment_decoder                                      
----------------------------------------------------------------------------------------------------
 (4558,3481,3472,SR-39,15,672,"2014-07-15 17:30:00-07",14,8.04274565301844,0.8,86,356.285714285714)
(1 row)

This is more difficult to use in an INSERT clause. While I could contort that, and make it work, I decided to instead just keep the function as a function, and include the query to the XML data table within the function. Again, the excellent PostgreSQL docs are quite helpful, and explain how to query a table from Perl and then iterate over each returned row. My new function looks like this:

CREATE OR REPLACE FUNCTION perl_xml_segment_obs_decoder () RETURNS setof bt_xml_observation AS $$
    use strict;
    my $unescape = sub {
        my $escaped = shift;
        $escaped =~ s/%u([0-9a-f]{4})/chr(hex($1))/eig;
        $escaped =~ s/%([0-9a-f]{2})/chr(hex($1))/eig;
        return $escaped;
    }; # borrowed from  URI::Escape::JavaScript 

    my $sth = spi_query("SELECT * FROM perlhash");
    while ( defined( my $row = spi_fetchrow($sth) ) ) {
        my $chars = $unescape->( $row->{data} );
        my $VAR1;
        eval($chars);

        # clean up some entries we are not using
        my $segment = $VAR1->{'segment'};
        $segment->{'ts'} = $segment->{'Timestamp'};
        my %bar = map { lc $_ => $segment->{$_} } qw{
          SegmentID
          ts
          NumTrips
          Speed
          Distance
          EstimatedTimeTaken
          TravelTime
        };
        $bar{data_ts}         = $row->{ts};
        $bar{radar_lane_id}   = $row->{radar_lane_id};
        $bar{station_lane_id} = $row->{station_lane_id};
        return_next \%bar;
    }
    return undef;
$$ LANGUAGE plperl;

Because I'm actually following along my git commits, and because I was refactoring things and tuning my relational database tables as I developed, this function returns a different table type from before:

CREATE TABLE bt_xml_observation(
  segmentid      integer not null references bt_xml_segment(segmentid),
  ts    timestamp with time zone not null,
  data_ts timestamp with time zone not null,
  radar_lane_id integer,
  station_lane_id integer,
  numtrips       integer,
  speed          numeric,
  distance           numeric,
  estimatedtimetaken numeric,
  traveltime         numeric,
  primary key(segmentid,ts,data_ts,radar_lane_id,station_lane_id),
  foreign key (data_ts,radar_lane_id,station_lane_id) references smartsig.bluetooth_data(ts,radar_lane_id,station_lane_id)
);

I use this function within an insert statement, as follows:

insert into bt_xml_observation  (select  * from perl_xml_segment_obs_decoder()) ;

In some cases (when populating the segments and location tables, for example), the output of the function includes duplicates. Rather than handle them in the Perl code using a hash or something, I decided to keep the PL/Perl simple and use SQL to remove duplicates. My query for loading up the segments table (the 8 unique segments about which the data was collected) is:

insert into smartsig.bt_xml_segment  (select distinct * from smartsig.perl_xml_segment_decoder()) ;

Finally, I expanded my node.js code to make use of these functions. Each data file (representing an hour of data) was 18MB. My code loads up one file, saves the XML/Perl hash data into a “TEMP” table, and then uses that table to populate the observations. The insert statements use WITH clauses to query the functions, as well as to join those call with the existing data so as to avoid the error of inserting duplicates. Finally, my code is careful to populate the tables in order so that the various foreign key constraints are satisfied. (Note that I like to build my SQL statements as an array that I then “join” together. I do that in whatever language I’m programming in because it makes it easy to slot in dynamic variables, print diagnostic output, etc)

    this.perl_parser=function(client,callback){
        // essentially, I have to do these in order:

        var insert_statements = []
        insert_statements.push([
            "with"
            ,"a as ("
            ,"  select distinct * from perl_xml_project_decoder_from_location()"
            ,"),"
            ,"b as ("
            ,"  select a.*"
            ,"  from a"
            ,"  left outer join bt_xml_project z USING (projectid)"
            ,"  where z.projectid is null"
            ,")"
            ,"insert into bt_xml_project (projectid,title) (select projectid,title from b)"
        ].join(' '))

        insert_statements.push(
            ["with a as ("
             ,"select aa.*,count(*) as cnt from perl_xml_location_decoder_from_location() aa"
             ,"left outer join bt_xml_location z USING(locationid)"
             ,"where z.locationid is null"
             ,"group by aa.locationid,aa.locationname,aa.latitude,aa.longitude,aa.projectid"
             ,"),"
             ,"b as ("
             ,"select locationid,locationname,latitude,longitude,projectid,"
             ,"rank() OVER (PARTITION BY locationid ORDER BY cnt DESC) AS pos"
             ,"from a"
             ,")"
             ,"insert into bt_xml_location (locationid,locationname,latitude,longitude,projectid)"
             ,"(select locationid,locationname,latitude,longitude,projectid"
             ,"from b"
             ,"where pos=1)"].join(' ')
            )
        insert_statements.push([
            "with a as (select distinct aa.* from perl_xml_segment_decoder() aa"
            ,"left outer join bt_xml_segment z USING(segmentid)"
            ,"where z.segmentid is null)"
            ,"insert into bt_xml_segment (segmentid,fromlocationid,tolocationid,route,groupby,projectid)"
            ,"(select segmentid,fromlocationid,tolocationid,route,groupby,projectid from a)"
        ].join(' '))
        insert_statements.push(
            'insert into bt_xml_observation  (select  * from perl_xml_segment_obs_decoder())'
        )


        var q = queue(1);  // using queue (https://github.com/mbostock/queue)
                           // with parallelism of 1 to make sure each task 
                           // executes in order

        insert_statements.forEach(function(statement) {
            q.defer(function(cb){
                client.query(statement
                             ,function (err, result) {
                                 //console.log(statement)
                                 return cb(err)
                             })
            })
            return null
        })
        q.awaitAll(function(error, results) {
            //console.log("all done with insert statements")
            return callback()
        })

    }

And there you have it: a node.js program that runs SQL queries that use Perl code embedded in PL/Perl functions.

The gory details can be found in my github repo for this.

More with the GDAL/OGR perl bindings

So my last post talked about my struggles to finally get something saved in the database using the native perl bindings into the GDAL/OGR library. Once I got that working and pushed out the post, I immediately started loading up multiple files and playing around with the data. One thing I noticed was that it was impossible to separate different “trips” within the data without playing around with space and time. What I wanted was an easy way to flag each batch of points with a field identifying the run.

The auto-generated schema for the GPX data looks like this:

d testogr.track_points
                                              Table "testogr.track_points"
       Column       |           Type           |                               Modifiers                                
--------------------+--------------------------+------------------------------------------------------------------------
 ogc_fid            | integer                  | not null default nextval('testogr.track_points_ogc_fid_seq'::regclass)
 wkb_geometry       | geometry(Point,4326)     | 
 track_fid          | integer                  | 
 track_seg_id       | integer                  | 
 track_seg_point_id | integer                  | 
 ele                | double precision         | 
 time               | timestamp with time zone | 
 magvar             | double precision         | 
 geoidheight        | double precision         | 
 name               | character varying        | 
 cmt                | character varying        | 
 desc               | character varying        | 
 src                | character varying        | 
 link1_href         | character varying        | 
 link1_text         | character varying        | 
 link1_type         | character varying        | 
 link2_href         | character varying        | 
 link2_text         | character varying        | 
 link2_type         | character varying        | 
 sym                | character varying        | 
 type               | character varying        | 
 fix                | character varying        | 
 sat                | integer                  | 
 hdop               | double precision         | 
 vdop               | double precision         | 
 pdop               | double precision         | 
 ageofdgpsdata      | double precision         | 
 dgpsid             | integer                  | 
 speed              | double precision         | 
Indexes:
    "track_points_pkey" PRIMARY KEY, btree (ogc_fid)
    "track_points_wkb_geometry_geom_idx" gist (wkb_geometry)

There are three fields that are completely blank: src, desc, and name. I decided to use src to identify the source of the data as the file name it came from.

First I modified my previous program to parse the command line options using Getopt::Long. I don’t use all of its power in this example, but in the past I’ve been well served by starting with that in case the script grows and mutates.

With Getopt::Long, I understand there are ways to input a list of things into the arguments. You can have multiple invocations of the same option, for example, --file mydata.gpx --file moredata.gpx, or you can input them as a comma separated list and follow the recipe in the perldoc for the module. However, I wanted to use a glob, like –file data/*.gpx, so I instead decided to just stick all the files after a double dash on the command line. So really, in the following code, I’m only using Getopt::Long to parse out a –help command! However, it’s there if I need to expand functionality in the future.

use strict;
use warnings;
use Carp;

use Geo::GDAL;
use Data::Dumper;

use Getopt::Long;
use Pod::Usage;

my $man = 0;
my $help = 0;

my @files;

my $result = GetOptions(
    'help|?' => $help,
    ) or pod2usage(2);

pod2usage(-exitval => 0, -verbose => 2) if $help;

@files = @ARGV;
...

With that, I have all of my input files in an array, and I can loop over them and store the filename in the source field in the db by using $new_feature->SetField('src',$_);, as follows:

foreach (@files){

    my $ds = Geo::OGR::Open($_);

    my $layer         = $ds->Layer($layer_name);
    my $feature_count = $layer->GetFeatureCount();
    carp "$layer_name, $feature_count";
    if ( $feature_count < 10 ) {
        next;
    }

    carp "saving $_ to pg";

    # now append each feature
    my $x = 0;
    $pg_layer->StartTransaction();
    while ( my $feature = $layer->GetNextFeature() ) {

        my $new_feature = Geo::OGR::Feature->new($defn);
        $new_feature->SetFrom($feature);

        # write the filename as the src field, for making lines later
        $new_feature->SetField('src',$_);

        my $pgf = $pg_layer->CreateFeature($new_feature);

        $x += 1;
        if ( $x % 128 == 0 ) {
            carp $x;
            # uncomment the following to crash your program
            # $pg_layer->CommitTransaction();
            # StartTransaction() seems to auto commit prior transaction?
            $pg_layer->StartTransaction(); 
            $x = 0;
        }

    }
    if ($x) {
        carp "all done, $x remaining";
        $pg_layer->CommitTransaction(); # this one doesn't crash for some reason
        carp "last transaction committed";
    }
}

That does its magic, and the database now has distinct groups of points. Now if you want to make “lines” out of those points, you can do this in PostGIS:

SELECT ST_MakeLine(wkb_geometry ORDER BY track_seg_point_id ASC) AS linegeom, src
INTO table testogr.lines
FROM testogr.track_points
GROUP BY src;

Et voila

QGIS rendering the new lines table, on top of OSM lines data

QGIS rendering the new lines table, on top of OSM lines data

Of course, that isn’t at all helpful, as I want to see speeds, not just the lines. Next step is to try to figure out how to add a measure to each point, and then collect those (X,Y,M) type points into a line with a measure dimension. I guess that will be my next post.