A new project we are trying to get going here at the newly re-christened California Traffic Management Labs is an information portal that documents everything related to transportation modeling. While this is really great work for us academics, it presents some interesting problems. On the one hand, it is simple. All we have to do is write down everything we collectively know about all the relevant papers, research, programs, and practice, and then put hyperlinks on everything, and we’re done.
I did this sort of thing once before. A while ago I developed a system I called Academic. Academic was a bunch of perl and mysql running on top of slashcode…yes the same code running slashdot, circa 2003. It died because a hard drive crashed, slashcode is a nightmare, my database schema was too inflexible, and I found a leech was logging into the system and poaching my work (I naively expected people would contribute as much as they took away, but there is no GPL for research synthesis). In short, I never once considered actually unzipping that code, but having done a project once you have a feel for how to do it better the second time around.
So the details of how to load up academic references and how to physically links files and topics and tags is something I know how to do. (I actually have a new version running on top of CouchDB). The central problem is not the web site design or the database backend, but how to organize knowledge. We’re making a library of sorts, but by removing the physical limits of bookshelves and rooms, we can conceivable have everything just a click away from everything else. But too many things to click on a page is about as enlightening as showing a blank page. We need the system to enforce some discipline on the information content so that relationships actually mean something.
I am taking the approach of deferring as many choices as possible, and striving to allow those who populate the site (me, my colleagues, some hand-picked grad students, anybody who grabs the code once I push it out to github, …) to make the important choices about what relates to what. But I want to avoid the problem that seems to crop up in “everybody can edit” wikis of lots of similarly named and yet different pages. To that end, I’ve decided to organize the information around the idea of topics. A topic is more than a simple definition (although every topic has a simple definition). A topic is closer to a white paper or a survey paper. When one calls up the “Transportation Modeling” topic, the site should display a reasonably complete paper describing what transportation modeling is, including it history, current trends in the practice, active areas of research, and so on.
But the opportunity for disaster is right below the surface. In my current system, I am creating topics and forming generic links between topics. This is nothing nearly as sophisticated as a topic map, but maybe it needs to be. My idea is to use the relationships between topics to pull up material that might be related to the topic in question. The problem is that if a topic gets too broad, then the list of relevant resources will become unusable. For example, because the site is intended to document transportation modeling issues, the topic called “Transportation Modeling” should relate to pretty much everything else in the site. The information content of such a page is about equal to white noise.
The simple solution is to just make subtopics of big topics. We might have a section on “pricing models”, and another on “microsimulation models”, and so on. But the danger is that at some point down the road somebody is going to create a topic called “microsimulation of road pricing”, and link that topic to “microsimulation” and to “pricing”, and boom, the new topic is white noise, and even pricing and microsimulation are now less useful because they will start pulling in resources from the new hybrid topic(s).
We need to be clever about that eventual use case, so that new topics that relate to many other topics might start out with useless jumble of possibly related content, but the jumble can be quickly sorted into relevant and irrelevant material. In fact, most of my thinking about this project revolves around making this case easy to handle.