Thinking about documenting everything

A new project we are trying to get going here at the newly re-christened California Traffic Management Labs is an information portal that documents everything related to transportation modeling. While this is really great work for us academics, it presents some interesting problems. On the one hand, it is simple. All we have to do is write down everything we collectively know about all the relevant papers, research, programs, and practice, and then put hyperlinks on everything, and we’re done.

I did this sort of thing once before. A while ago I developed a system I called Academic. Academic was a bunch of perl and mysql running on top of slashcode…yes the same code running slashdot, circa 2003. It died because a hard drive crashed, slashcode is a nightmare, my database schema was too inflexible, and I found a leech was logging into the system and poaching my work (I naively expected people would contribute as much as they took away, but there is no GPL for research synthesis). In short, I never once considered actually unzipping that code, but having done a project once you have a feel for how to do it better the second time around.

So the details of how to load up academic references and how to physically links files and topics and tags is something I know how to do. (I actually have a new version running on top of CouchDB). The central problem is not the web site design or the database backend, but how to organize knowledge. We’re making a library of sorts, but by removing the physical limits of bookshelves and rooms, we can conceivable have everything just a click away from everything else. But too many things to click on a page is about as enlightening as showing a blank page. We need the system to enforce some discipline on the information content so that relationships actually mean something.

I am taking the approach of deferring as many choices as possible, and striving to allow those who populate the site (me, my colleagues, some hand-picked grad students, anybody who grabs the code once I push it out to github, …) to make the important choices about what relates to what. But I want to avoid the problem that seems to crop up in “everybody can edit” wikis of lots of similarly named and yet different pages. To that end, I’ve decided to organize the information around the idea of topics. A topic is more than a simple definition (although every topic has a simple definition). A topic is closer to a white paper or a survey paper. When one calls up the “Transportation Modeling” topic, the site should display a reasonably complete paper describing what transportation modeling is, including it history, current trends in the practice, active areas of research, and so on.

But the opportunity for disaster is right below the surface. In my current system, I am creating topics and forming generic links between topics. This is nothing nearly as sophisticated as a topic map, but maybe it needs to be. My idea is to use the relationships between topics to pull up material that might be related to the topic in question. The problem is that if a topic gets too broad, then the list of relevant resources will become unusable. For example, because the site is intended to document transportation modeling issues, the topic called “Transportation Modeling” should relate to pretty much everything else in the site. The information content of such a page is about equal to white noise.

The simple solution is to just make subtopics of big topics. We might have a section on “pricing models”, and another on “microsimulation models”, and so on. But the danger is that at some point down the road somebody is going to create a topic called “microsimulation of road pricing”, and link that topic to “microsimulation” and to “pricing”, and boom, the new topic is white noise, and even pricing and microsimulation are now less useful because they will start pulling in resources from the new hybrid topic(s).

We need to be clever about that eventual use case, so that new topics that relate to many other topics might start out with useless jumble of possibly related content, but the jumble can be quickly sorted into relevant and irrelevant material. In fact, most of my thinking about this project revolves around making this case easy to handle.

Advertisements

3 thoughts on “Thinking about documenting everything

  1. Hi there – I am working on a similarly academic project: http://philpapers.org/ – it is a site for philosophers and in next few months we’ll be working on publishing the source code. We have a rigid tree like taxonomy instead of your ‘topic maps’ – so I don’t think this would help you in the particular problem you described here – but I believe there are tons of other less core techniques that we could share. It would be great if we could collaborate on this – and in particular in making CPAN modules out of the interesting bits of code.

  2. Zbigniew, http://philpapers.org/ looks pretty good! And it seems to be rather well populated with content and users. Awesome work.

    At the moment (and hopefully going forward) my app is a CouchDB app. I’m using perl to listen to the changes feed and perform server side actions (only barely started with coding and testing…nothing production quality yet). I was tracking DB::CouchDB::Schema and keeping it current on my fork on github with the latest CouchDB, but I couldn’t keep up and have recently been migrating my code over to AnyEvent::CouchDB.

  3. I am just a contractor here – David Bourget is the original author of the site – so these complements should go to him :)

    The site is built with Mason, Rose::DB and MySQL – so I guess it might be too different from what you are using for a direct reuse of the core parts – but still there are lot’s of things that I think you could borrow from our code, starting from the more decoupled tasks like parsing author names that we are planning to publish as independent CPAN libraries, through things like interpreting journal RSS feeds and OAI sources that are coupled to the structure of the article object that we build out of them – but it is quite possible that they can be made more independent, to parts of code that are really coupled with the data structures here – but still could be a good guides on how to write a similar solution – like the stuff for automatic categorizing of articles. All of that will be published in the upcoming months and we would really appreciate feedback from people that would wish to use it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s