Tuesday, June 13, 2006

Broken Links

Day two of CHNM's Doing Digital History workshop kicked off with a topic close to my heart: data and text mining. One of the things that we discussed was link analysis (also known as graph mining or relational data analysis), the ability to exploit connections between entities as a way of making inferences or refining searches. Google makes a different, but related, use of links in its page rank algorithm.

The discussion got me to thinking about broken links. We've all had the experience of clicking on a hypertext link and getting an HTTP 404 error, the message that the requested file cannot be found. It seems to be generally accepted that broken links are a bad thing. It is quite easy to write scripts that check each of the links on a web page and report on broken ones. If you don't want to write your own script, the W3C has an online link checker. In fact, a 2004 article in the BBC Technology News offered the hope that broken links might one day be eliminated altogether.

The article describes research done by student interns in conjunction with IBM. Their system follows working links to create a "fingerprint" of each page that is linked to. It can then determine when content of the target page changes and notify system administrators or even change the link automatically. Such a system would reduce lost productivity and prevent large corporations from getting into the embarassing position of linking to an innocuous site only to have it change into something disreputable.

So far, so good. But what if we consider broken links to be a kind of historical evidence? Existing link checkers can already spider through whole sites looking for broken links. Rather than fixing them (or before fixing them) why not compile an archive of them to study? We could ask why links get broken in the first place. Sure, some are bound to be typos. But in many cases, the target site has moved to a different address, and this continual process of renaming reflects other processes: of rebranding, search engine positioning, system or business process reorganization, and so on. We should take a page from the archaeologists' book, and pay more attention to what our middens have to tell us.

