Tedious but necessary

I’ve found that I prefer making things to maintaining things. My wife will testify that tidying up is not my forte, but that I don’t mind the most laborious cooking task.

I’m not sure where loading up raw data falls into the spectrum between making and cleaning, but unfortunately I’m pretty good at it so I usually have to do it myself. In some ways it is like cleaning up—carefully pick up this row of data, process it, and place it where it belongs. But it also has aspects of making things, and that’s perhaps why I am good at it.

I’ve recently been massaging HSIS data and raw PeMS data. Both are in csv files, and for both I’m using Perl to parse each line before storing. The HSIS data I am stuffing directly into a pgsql database using copy commands (very fast), while the PeMS raw data I am stuffing into a CouchDB database (slower, but I can set up some very fast queries using CouchDB’s map/reduce views). Needless to say, I’ve found problems with both data sets.

First the HSIS data. The Highway Safety Information System is a multistate database that contains crash, roadway inventory, and traffic volume data for a select group of States. I requested and was sent 2005, 2006, and 2007 accident data for California for a project we are working on. The HSIS is much better than the prior method of using California’s TASAS data directly. The data are well documented, and aside from the slightly annoying requirement to go through a live individual to get the data, it is easy to get access to the data.

Looking at the California data dictionary, it is pretty easy to extract a few main tables (for the data I am using, I have acc, veh, and road tables) plus a bunch of tables that provide definitions for the various one-letter codes used in the data. For example, highway group ‘L’ is a ‘Left independent alignment’. I created tables in PostgreSQL for all of these tables, and then added foreign key constraints on the variables in the main three tables. If a value for a variable wasn’t in the linked table, then PostgreSQL will choke and refuse my copy command, forcing me to fix the error.

Usually in cases like this, the errors are related to parsing problems—my mistakes. However, I’m using the excellent Perl module Text::CSV, so there is no way I’m incorrectly parsing the CSV data. What I’ve found is that even with the impressive data dictionary linked above, there are numerous errors. For example, in the accident files, I never found an entry ‘<', when in fact there are supposed to be such values for 'acctype', 'weather', 'loctyp' and 'light'. Instead, the accident data file contains '-' symbols. This isn't a big deal, but what a pain. More of a problem was the occurrence of 'L' and 'R' values for 'psmilsuf'. I have no idea what these values might mean. I guessed that maybe they mean the same as for 'psmilprf', but maybe not; maybe they should be the same as the 'L' and 'R' in the 'hwygrp' variable.

Once I manually fixed all of those inconsistencies, the data loaded up in a jiffy.

My main code loop, for posterity, is as follows:

    # foreign key constraints means two passes through the data
    # first copy just the case number into the cases table
    my @case_copy = $vdb->storage->dbh_do(
        sub {
            my ( $storage, $dbh, $file ) = @_;
            $dbh->do('COPY hsis.cases (caseno) from STDIN with csv');

            while ( my $line = $csv->getline_hr($file) ) {
                $dbh->pg_putcopydata( $line->{'caseno'} . "\n" );

    #reset the file handle for the next pass
    $z->seek( 0, 0 );
    # again, burn the first line as it is just column labels

    # now load up the acc table
    my @rows = $vdb->storage->dbh_do(
        sub {
            my ( $storage, $dbh, $file ) = @_;
'COPY hsis.acc (cntyrte,milepost,rodwycls,caseno,psmilprf,psmilsuf,acctype,acc_ts,hour_known,weather1,weather2,loc_typ,sde_hwy,towaway,severity,rdsurf,light,trk_inv,mtcy_inv,distance,numvehs) from STDIN with csv'

            while ( my $line = $csv->getline_hr($file) ) {

                # using csv to reassemble timestamp from date and time
                my $hr;
                my $mn;
                my $yr;
                my $mon;
                my $dy;

                if ( $line->{'hour'} =~ /(\d\d)(\d\d)/sxm ) {
                    $hr = $1;
                    $mn = $2;
                if ( $line->{'acc_date'} =~ /(\d\d\d\d)(\d\d)(\d\d)/sxm ) {
                    $yr  = $1;
                    $mon = $2;
                    $dy  = $3;

                # if hour is 25 or 26, change to 00 and make it unknown
                if ( $hr > 24 ) {
                    $hr                   = 0;
                    $mn                   = 0;
                    $line->{'hour_known'} = 'f';
                else {
                    $line->{'hour_known'} = 't';
                my $dt = DateTime->new(
                    year   => $yr,
                    month  => $mon,
                    day    => $dy,
                    hour   => $hr,
                    minute => $mn,
                $line->{'ts'} = $dt->datetime();

                $csv2->combine( map { $line->{$_} ? $line->{$_} : q{} }
                      qw{ cnty_rte milepost rodwycls caseno psmilprf psmilsuf acctype ts hour_known weather1 weather2 loc_typ sde_hwy towaway severity rdsurf light trk_flg mtcy_flg distance  numvehs }
                my $reassembled_line = $csv2->string() . "\n";



The $vdb->storage->dbh_do bit is from DBIx::Class::Storage::DBI. It is a very cool way to get at the raw DBI stuff while still having access to the high-level DBIx::Class stuff. I found that method while looking up how to get a dbi handle directly in order to get at the DBD::Pg PostgreSQL copy support.

The PeMS raw data didn’t offer so many problems with parsing, as it is just a rather basic list containing the loop detector group id, the timestamp, then the volume, occupancy, and optionally speed for each lane. My code works great to parse that. Instead I have different problems that are tedious and need careful solutions. First, the data take up a lot of space. The zipped files are large; the unzipped stored files are huge. I’m using CouchDB, as I said above, mostly to test out the space constraints and compare the querying speed for our uses of the data. A single district (our local D12 covering Orange County, California) for an entire year, after compaction, takes up about 276G. Because I wasn’t sure how much space would be needed. I manually “sharded” the data by saving each month to its own database. So I actually have 12 different CouchDB databases of between 22G and 26G each. Then I built two views (which will probably be combined into one soon) that together take up about 6G per month, or 72G total. Districts 7 and 4 are also very large, but the other Caltrans districts aren’t much of a problem. Still, figuring out how to handle such a large amount of data is tedious. Keeping it compressed in zip files is the best for space, but access is ridiculously slow. Loading it up into PostgreSQL is an okay option, but if the data get larger than the disk space on one machine, I will have to figure out how to set up a psql cluster (I know such a thing exists, but I don’t really want to deal with learning how to set it up). In addition to psql and CouchDB, I also tried Tokyo Cabinet/ Tokyo Tyrant, but just figured out that I didn’t really want a key value store, and that the documentation of the other database options was lacking. I blogged about my problems with Tokyo Tyrant earlier.

But more irritating with the raw PeMS data is that I have a sneaking suspicion that there is a fundamental flaw in how PeMS is dumping their data to csv and making it available to their users. Specifically, here at UCI we have access to raw loop data from D12 (when the stream is up, of course). I’ve used that data extensively, and know for a fact that sometimes one lane will report null, while the other lanes report valid counts and occupancies. I even wrote a cool little bit of Java that uses the E-M algorithm to infer the most likely values for the missing data points given a block of data. But looking at the PeMS raw data, I never see any cases where one lane is null while others are non-null. After just finishing up writing the programs to parse CSV from PeMS and HSIS, I’m used to weird edge cases, like forgetting to handle quoted entries with embedded commas, or handling null values. I suspect that either every record with a null value was dropped from the output statement or that the CSV dumping routine wrote 0 for null values. Either way, the supposed “rawness” of the raw 30 second loop data is called into question when errors that are pretty common in my experience aren’t showing up at all. Perhaps nobody bothered to go through the tedious but necessary job of making sure that the output was correct.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.