Saturday, January 28, 2006

Text Mining the DCB, Part 1

So far in our digital history hacks we have been working with the online Dictionary of Canadian Biography. The DCB has many properties which make it a good testbed for developing hacks. Unlike the American National Biography Online or the Oxford Dictionary of National Biography, the DCB is freely available. With about 10,000 entries, it is also small enough to be easily processed, yet large enough to make computational methods worthwhile.

Our previous hacks explored the categories to which the editors had already assigned many of the biographies. Our long-term goal, however, is to discover new information in online historical sources, both primary and secondary. Almost all of the existing works on digital history emphasize how new technologies and new media are changing the ways that we gather, preserve and present the past. (This is a paraphrase of the subtitle of Cohen & Rosenzweig's excellent Digital History; another example is David J. Staley's Computers, Visualization and History.) This explosion of online sources also calls out for a new historical methodology, however. Over the next few decades, finding a 'methodology for the infinite archive' will require at least as significant a reorientation in historical practice as did the work of von Ranke.

The crux of the problem is simple: every year we are creating an untold amount of digital information. In 2003, researchers at the School of Information Management and Systems at UC Berkeley estimated that the amount of new information that had been created the previous year was about 37,000 times larger than the book collection of the Library of Congress. Ninety-two percent of that information was stored on magnetic media, mostly hard disks. Needless to say, this has serious implications for the practice of history (see, for example, David Talbot's article, "The Fading Memory of the State.")

Enter text mining, an emerging field that draws on techniques from machine learning, computational linguistics, information retrieval and other disciplines to discover new information in unstructured data. (For a recent introduction to text mining, see Weiss et al, Text Mining.)

Using text mining on the DCB is going to be much more involved than anything we have done before, so we will proceed via a series of steps. The first thing that we want to do is create a local repository of the text to be mined. We go to the DCB website, create a search page of biographies of interest, and save the HTML file. I will choose Volume 1 of the DCB, which has biographies of 592 individuals who died between AD 1000 and 1700, and save the file as "dcbo-vol1.html". Next, we write a short hack to scrape the IDs and names from that file, and save the new file as "dcbo-vol1-ids.txt". At this point we are almost ready to download the biographies.

Before we do, however, we should first check the terms of use of the DCB site to make sure that we are not going to violate any of their policies. They say that the information can be reproduced for personal, noncommercial use "in part or in whole and by any means" without special permission. Good! We write another hack to download the 592 biographies from Volume 1 to our machine. (It is important when doing something like this to be a good citizen and not hammer their server, so be sure to code a small break between each download).

Next, we will have to strip out all of the HTML formatting for each biography...

(26 Sep 2008: Links to code updated)

Tags: | | | | | |