Sunday, February 19, 2006

Text Mining the DCB, Part 5

My original plan for this series of posts was to show how it is relatively simple to spider an online historical collection like the Dictionary of Canadian Biography, scrape out some information, and use it to make possibly novel inferences. Up until now, we've been storing information in text files, which is OK for a simple demo but unwieldy if we want to try something more ambitious.

As a result I've decided to make a few modifications so that we can store information in a relational database instead. Eventually, I will probably implement something open source on a server, like MySQL. In the meantime, however, I already have MS Access installed on my computer so that is what I am going to use. It shouldn't be too hard to port later.

In my next post I will describe the intial data tables and the way we get Perl to talk to our database...

Tags: | | |

Friday, February 10, 2006

Historiographical Process(ing)

In my first post, I posed the problem of developing methodologies for an archive that is constantly changing and effectively infinite. Obviously this has implications for the way we think of traditional activities like creating bibliographies and writing historiography. Consider the way that dissertations are usually written: the student does a literature review, writes a historiographical introduction and dissertation proposal, then does archival work and writes the rest of the monograph. Research handbooks suggest that one should check for new literature when the monograph is nearing completion, so that it can be as up-to-date as possible. Given the hell that is the academic job search, this may or may not happen.

One problem with the traditional model is that most academics don't seem to realize that the world of scholarship has completely changed within the last seven years. Even the most newly minted of PhDs began his or her dissertation in the aftermath of the dot.bomb, when it wasn't clear which companies would survive, or what the web of the future would be like. Google was coming to prominence as the interface to the web, and new sources for just about any research topic could be turned up regularly. One measure of this historical shift comes from the data mining research group at the Online Computer Library Center (OCLC), the folks who bring us WorldCat, among other things. In a presentation from last year, they show that the number of records for digital materials entered into WorldCat was less than about 20,000 per year from the mid-1980s through 1998. In 1999, it jumped to about 30,000. In 2000, it jumped again, this time to over 160,000. Every year since then, more than 100,000 records have been entered for digital materials each year. That is just the stuff that is showing up in WorldCat. Not counting Google's relatively new project to make 30 million books full text searchable.

It's time we rethink bibliography and historiography as processes, or better yet, as processing, as something that our bots can be continually working on in the background.

One vision for this comes from the unsettling world of contemporary data mining. In O'Harrow's No Place to Hide, he quotes Jeff Jonas, chief scientist at Systems Research & Development:

Our work is about perpetual analytics, instant intelligence, as fast as something is introduced, instantaneously being able to tell if that means something important to you. You're sitting under an ocean of data, and every day millions of gallons are being added, and every day you have to go through zillions of drops to find out whether there's something important in there. You're slicing time down to the nanosecond, so you can see every drop hit. So when each drop hits you can see where it lands, what it's next to. You can measure the ripples, and there is an instant where you can make interesting decisions about what has changed.

Tags: | | | | |

Thursday, February 09, 2006

Text Mining the DCB, Part 4

In our efforts to do some simple text mining of the online Dictionary of Canadian Biography, we created a list 65 distinctive features and located each of the biographies in Volume 1 in a 65-dimensional space. The next step is to figure out how far (in some sense) each of the biographies is from its neighbours.

Remember that we coded the absence of a given feature as '0' and the presence as '1' for each biography. This means that we can compare any two biographies simply by lining up the corresponding values for each feature, like so (assuming, for the sake of example, that we only have six features instead of 65):

Argall, Sir Samuel111000
Champlain, Samuel de101101

Now we have a number of options for computing the distance between any two biographies. One possibility would be to add up the number of features that both have in common or both lack. For the example above, the two biographies share the first and third features and both lack the fifth. A simple distance measure is


Since these two biographies have three common features, their distance by this measure is

1-(3/6) = 0.5

If two biographies had all of the same features, then their distance would be

1-(6/6) = 0

which makes sense, and if they had no common features

1-(0/6) = 1

This measure is applied to all of our biographies with this hack. We can then write another hack to sift through the file looking for biographies that are very close together or very far apart.

Before looking at the results, it is worth trying to predict what we might find. Since our distance measure includes features that two biographies lack, one way for a pair of biographies to be close is for both to lack a lot of features. This might especially be the case for very short entries. (Another possibility, and the preferred one, would be for two biographies to match because they shared a lot of the same features. If this turns out to be the behaviour that we want, we can easily adjust our distance measure to only count shared features, and not the shared lack of features.) How might a pair of biographies be far apart? They may disagree on a lot of features, and this would be the preferred result. But it is also possible that one biography might have a lot of features because it is relatively long, and would thus differ from short ones. Again, this may or may not be interesting if it happens, and we may have to frob our distance measure.

So what happens when we run the hacks? The biographies that are closest together by this measure are ones like François Bailly, Mathurin Gagnon, Marie-Françoise Giffard, Jean Guyon du Buisson, and Marie Irwin. Each is very short, confirming our suspicion that we might not want to pay so much attention to a lack of features. (As an aside, however, note that this measure pulls out the men and women who are more obscure, and thus could be useful for finding historical topics in the 'long tail'.)

What about the pairs of biographies that are far apart? Here we find something quite unexpected. There are two biographies that are very distant from a bunch of the others, Samuel de Champlain and Jean Talon. These two men dominate most accounts of New France. Their entries in the DCB are very long. So not only does minimizing this distance measure give us a way of approaching Canadian historical biography from below, but maximizing it allows us to sift out the 'great men'.

At this point, a skeptic might say that all we have done is create a nonobvious way to estimate the length of biographical entries... the longer the entry, the more likely it is to have positive features. We can easily test this hypothesis by checking the filesizes of all of the biographies. Sure enough, Champlain's and Talon's are the longest in Volume 1. So we will want to refine our distance measure to exclude the shared lack of features, when we move to the next step of our investigation, clustering.

Tags: | | | | | |

Tuesday, February 07, 2006

Doing Digital History

I can think of a lot of reasons to be interested in doing digital history, so I am always curious to find out why other historians are not. In talking to a variety of people, I've heard the following objections, in no particular order.

The sources that I need for my project are not online. I understand this one. If you are in the middle of writing a book about the Hudson's Bay Company's use of ships, you are not going to find many of your key primary sources online. You may be able to find a lot of contextualizing material, however. Cross-referencing the ships' area of service (from the HBC archives in Winnipeg) with the visual records search from the BC Archives in Victoria immediately turns up an illustration of the HBC ships Prince Albert and Prince Rupert (B-00261), and models of the schooner Cadborough (B-00499) and barque Columbia (B-00500), which can be returned as images in a handy contact sheet. Of course, these aren't the only two Canadian archives with information online. The page of Canadian Archival Resources on the Internet lists 82 archival websites for British Columbia alone. One of these links takes you to the British Columbia Archival Information Network, which has a page of online databases of historical photographs. Not only are there more sources online than you think, but with a bit of programming it becomes much easier to automate the process of finding and collating them.

We tried this in the 60s and 70s and it didn't work. I hear this one from more senior colleagues. Remember quantitative history? People spent a lot of time creating databases and linking records and now all of the information is inaccessible on old magnetic tapes. The difference, as my friend Marcel Fortin likes to point out, is that neither the World Wide Web nor open source had been widely adopted yet. When a resource such as the Dictionary of Canadian Biography is made available online, it can become a platform for further innovation.

Won't someone else write a program which I can use? Yes and no. We all use word-processors, spreadsheets, e-mail, search engines, library catalogues, and so on. No one wants to re-invent those wheels. Digital history has become very promising because we have access to those tools and so many others: high-level scripting languages that make web programming easy (like Perl or Python), archives of powerful modules that can help you do almost anything you can think of (like CPAN for Perl), and application program interfaces (APIs) that allow programmers to build their own applications on top of those provided by Google, Yahoo!, Amazon, and thousands of other companies and institutions. (See Dan Cohen's excellent article for more on APIs in the digital humanities.) Sure, a few people are writing useful tools for historians. But if you want something tailored to your own research, and you need it now, you're going to have to roll your own. That means doing digital history.

And, as Steven Colbert would say, "that's the wørd."

Tags: | | | | |

Saturday, February 04, 2006

Text Mining the DCB, Part 3

Last week we began the process of text mining the online Dictionary of Canadian Biography. We created local HTML and text copies of each of the 592 biographical entries in Volume 1. We also passed the text copies through a commercial concordancing program to determine which words occurred most frequently. Once we threw out the words that were too common to provide much useful information ('the', 'a', 'and', etc.), we were left with a set of words which will potentially convey interesting and distinctive information about the biographies in Volume 1. A list of these words is in the following text file.

We can think of each of these words as a dimension in a highly multidimensional space. This is not immediately intuitive, so it helps to work up to the idea with simpler examples. Suppose we have two people, Samuel de Champlain and Sir Samuel Argall. In Champlain's biography, it mentions Montreal, but in Argall's biography it does not. If every biography that mentions Montreal is coded with a '1' and every biography that doesn't is coded with a '0', then we have a one-dimensional space. Every biography in this space will either be located at '0' or '1'. In fact, 196 of the 592 biographies in Volume 1 do mention Montreal, and the other 396 do not. In some sense, we want to say that those biographies that do mention Montreal are closer to one another (at least along that dimension) than the ones that do not.

But what if we take another word into account, like "Jesuit(s)"? We now have two dimensions and four possibilities: a biography might mention both Montreal and Jesuit(s), it might mention one but not the other, or it might mention neither. In fact, Jesuits are mentioned in both Champlain's and Argall's biographies. Neither Montreal nor Jesuits are mentioned in the biography of John Abraham, however.

In some sense, again, we would like to say that the biography of Champlain is a bit closer to that of Argall (from which it differs along one dimension) than it is to that of Abraham (from which it differs along two dimensions). We would also like to say that Argall's biography is the same distance from each of the other two. This is shown in the following table.

Abraham, John00
Argall, Sir Samuel10
Champlain, Samuel de11

We can go on adding words in this fashion, each becoming a dimension along which biographies can differ. (It becomes hard to visualize these dimensions once you have more than three, of course).

Given the biographies that we already downloaded and the 65 words in our text file, we can write a short hack to create a 65-dimensional feature space and locate each biography within it. We output the results as a spreadsheet.

Next we will want to formalize the idea of distances in our feature space...

Tags: | | | | | |