A lot of data is a lot of data

I can’t seem to get an efficient setup going for storing loop data in couchdb.  On the surface it seems pretty simple—every loop is independent of every other loop, so every observation can be a document.  But for this application this is more limiting than I first thought.  The problem is that after storing just a few days worth of data, the single couchdb database expands to 35GB.  I tried running my carefully crafted map/reduce to get mean and variance stats, and the process went out to lunch for days before I killed it off.

Continue reading

Trevor’s Autonet paper published

Trevor’s Autonet paper finally got published, and we’ve gotten a small bit of press.  Funny how that works.  Do research and build a prototype.  Write a paper or two or four, apparently get no interest.  Project mostly trickles off.  Then one paper finally gets published by a slower journal, and hey, everybody is interested.

While the ideas are good, and while Trevor and his team did a great job with the prototype and got a working system running, I think the real barrier to something like Autonet taking off is the difficulty in getting  a local area wireless connection up and running.  Not from a technical, bit/bytes/hand-off/Doppler-shift point of view.  Rather from a non-technical user’s point of view.  It is quite difficult to set up a device so that it both blabs and listens on some open wireless channel without requiring careful attention from the user.  Most wifi links, in contrast, are pretty simple to use because there is a defined server and client. But even then most dialogs ask the user to select which host to access, and some require some sort of password or access code.

In the intervening years between working on that stuff and where we are now, we’ve sort of come to the conclusion that the data channel isn’t as important as just freeing the information from the automobile.  From the person traveling, really.

The primary advantage of a local area wireless connection is that, well, those cars and devices you can talk probably have data that are relevant to you too, because you’re all sitting in the same spot.  The local area wireless link acts like a spatial query on the huge mountain of traffic data that is available.  The disadvantage is the need to configure your wireless device in a secure, user friendly way, and needing to develop some sort of protocol to query distant locations.

On the other hand, a cellular link does not have automatic spatial query on the data.  Of course you can *do* a spatial query, but that costs some cpu cycles, whereas with the Autonet idea, you’re *only* querying geographically proximate neighbors.  You’ve also got the problem that the wide area wireless links cost money to use.  Cellphone companies are known to charge outrageous rates for data transfer, and in fact, AT&T specifically forbids using their data connection in the manner in which we would *like* to use it.  To quote from their service agreement terms and conditions:

Prohibited and Permissible Uses: Except as may otherwise be specifically permitted or prohibited for select data plans, data sessions may be conducted only for the following purposes: (i) Internet browsing; (ii) email; and (iii) intranet access. …[T]here are certain uses that cause extreme network capacity issues and interference with the network and are therefore prohibited. Examples of prohibited uses include, without limitation, the following: (i) server devices or host computer applications, including, but not limited to, Web camera posts or broadcasts, automatic data feeds, automated machine-to-machine connections or peer-to-peer (P2P) file sharing; …

So, an app that automatically uploads location and speed and queries traffic conditions every few seconds is out, but an application that “browses the internet” is okay.   So an application that responds to user input to “browse” the internet with a heartbeat ping is probably okay, but making it a daemon that bleeps every few minutes is not.

Gotta get us some iPhones so we can test this stuff out, I guess.  Which means we have to get funding.

possibly inconsistent data

One of the things I am trying to figure out with couchdb is how to structure data so that it can’t be internally inconsistent, what is that, normalized, I guess.

So suppose I have Caltrans District, County, and City.  All of which are cleanly delimited, etc etc.  In a relational database, I’d enforce consistency by using foreign key constraints, so District 12 links to Orange County, and there can only be one link from a county to a district, etc.  But in couchdb you don’t get foreign keys.  So if I want to include data on the district, etc, I have to shove it into the document.  But that means I can make mistakes, and no one will stop me.

So I can have one document that says:

{
  'City' : 'Costa Mesa',
  'County': 'Orange',
  'District': 12
}

and another that says

{
  'City' : 'Newport Beach',
  'County': 'Orange',
  'District': 7
}

Even though the county of Orange should never be understood to be in District 7. Putting just the one-level-up doesn’t help either, because then I can’t sort on

[District,County,City]

And while I am  on the subject of sorting, I can’t yet figure out how to get a numerical sort of districts.  They are called 1, 2, 3, … , 12, but sorting them on District_id in the view and I get “1”, “10”, “11”, etc  alpha sorting, not numeric ordering.  I figure I’ll get that one sorted eventually.  I saw something that said to sort on dates, so I suppose it is a similar hack, or writing javascript to convert text to numbers in the view function before emitting the key.

couchdb

Stumbled across couchdb thanks to the Sakai devel mailing list.  Looks cool, but I need to use it before I can get my head around what it might be able to do.  I think a good toy application that will also be useful is to code up Mike’s glossary using it.  That would be good because the limitation of a wiki is that when you want to add links to existing pages, you have to guess or whatnot, and anyway a wiki is not a glossary.  I just want a dynamic, editable, cross-linked glossary.  That isn’t hand-generated!