Sunday, January 29, 2006

Text Mining the DCB, Part 2

Yesterday we began the process of text mining the online Dictionary of Canadian Biography by downloading all of the biographies in Volume 1 (people who died between AD 1000 and 1700). We saved these on a local machine as HTML files, but for now, we will want to process raw text files. We write a simple hack to strip out all of the HTML tags and do a little bit of additional cleanup.

At this point it is probably a good idea to sketch out where we are going. The basic idea in text mining is to represent the features of a collection of documents in such a way as to facilitate further processing. Each of our documents is a biography of some person in the DCB. Each feature is something that the biography may or may not share with other biographies, like the presence of a particular word.

We can imagine this in the form of a simple spreadsheet, with a '1' if a particular word appears in the biography and a '0' if not:

NameQuebecHBCAcadia
Abraham, John110
Aernoutsz, Jurriaen001
Agariata100


Many machine learning algorithms have been developed to work with this kind of representation, and it will be fairly straightforward to generate from our biographies.

At this point, however, it is not clear where the features come from. We could try to use every word that appears in at least one biography, but that would not be very effective. For one thing, there are 592 biographies in volume 1, so the total number of distinct words will be quite high. For another, some words (like 'the', 'a', 'and') will be so common that they will not provide much useful information.

In the future, we will explore methods to generate the list of features automatically. For now, we are going to use a commercial program to generate a concordance. A concordance tells us how frequently every single word occurs in the collection of biographies and lets us explore the contexts where it does.

Some of the information that we can extract from the concordance will be quite interesting. For example, since each date is treated as a word by the software, we can figure out which dates occur in the biographies more frequently than others. Before we do this calculation, we can make a few predictions. Since historians like round numbers, we expect that decades will tend to be more popular than surrounding years. If you know that something happened in the late 1630s or early 1640s, you are more like to write "around 1640" than "around 1639." The second prediction that we can make is that later dates will be more common than earlier ones. The closer you get to the present, the more people there are and the more we know about them on average. Finally, since the DCB is arranged by death dates, and since we haven't included any biographies from Volume 2, we expect the dates to taper off near AD 1700 (as people in Volume 1 die). A smooth curve plotted through the number of times each date occurs in Volume 1 looks like the following figure.



The curve looks pretty much like we expected it to. There are peaks at 1600, 1610, 1620 and most of the other decades in the seventeenth century. If we set those aside, we are left with a series of dates which may or may not be significant in early Canadian history: 1498, 1578, 1583, and so on. The most prominent seems to be 1666. In a future hack, we will try to automatically extract dates and match them against timelines from another source. For now it is sufficient to note that many of the peak dates coincide with well-known events in Early Canada. 1498 was the year of Cabot's second voyage, for example, and 1663 the year that the French crown took control of New France.

Next we will use the concordance to establish a list of features so we can create a spreadsheet representation of Volume 1 that looks like the one above...

Tags: | | | | | |

Saturday, January 28, 2006

Text Mining the DCB, Part 1

So far in our digital history hacks we have been working with the online Dictionary of Canadian Biography. The DCB has many properties which make it a good testbed for developing hacks. Unlike the American National Biography Online or the Oxford Dictionary of National Biography, the DCB is freely available. With about 10,000 entries, it is also small enough to be easily processed, yet large enough to make computational methods worthwhile.

Our previous hacks explored the categories to which the editors had already assigned many of the biographies. Our long-term goal, however, is to discover new information in online historical sources, both primary and secondary. Almost all of the existing works on digital history emphasize how new technologies and new media are changing the ways that we gather, preserve and present the past. (This is a paraphrase of the subtitle of Cohen & Rosenzweig's excellent Digital History; another example is David J. Staley's Computers, Visualization and History.) This explosion of online sources also calls out for a new historical methodology, however. Over the next few decades, finding a 'methodology for the infinite archive' will require at least as significant a reorientation in historical practice as did the work of von Ranke.

The crux of the problem is simple: every year we are creating an untold amount of digital information. In 2003, researchers at the School of Information Management and Systems at UC Berkeley estimated that the amount of new information that had been created the previous year was about 37,000 times larger than the book collection of the Library of Congress. Ninety-two percent of that information was stored on magnetic media, mostly hard disks. Needless to say, this has serious implications for the practice of history (see, for example, David Talbot's article, "The Fading Memory of the State.")

Enter text mining, an emerging field that draws on techniques from machine learning, computational linguistics, information retrieval and other disciplines to discover new information in unstructured data. (For a recent introduction to text mining, see Weiss et al, Text Mining.)

Using text mining on the DCB is going to be much more involved than anything we have done before, so we will proceed via a series of steps. The first thing that we want to do is create a local repository of the text to be mined. We go to the DCB website, create a search page of biographies of interest, and save the HTML file. I will choose Volume 1 of the DCB, which has biographies of 592 individuals who died between AD 1000 and 1700, and save the file as "dcbo-vol1.html". Next, we write a short hack to scrape the IDs and names from that file, and save the new file as "dcbo-vol1-ids.txt". At this point we are almost ready to download the biographies.

Before we do, however, we should first check the terms of use of the DCB site to make sure that we are not going to violate any of their policies. They say that the information can be reproduced for personal, noncommercial use "in part or in whole and by any means" without special permission. Good! We write another hack to download the 592 biographies from Volume 1 to our machine. (It is important when doing something like this to be a good citizen and not hammer their server, so be sure to code a small break between each download).

Next, we will have to strip out all of the HTML formatting for each biography...

(26 Sep 2008: Links to code updated)

Tags: | | | | | |

Sunday, January 22, 2006

The Adams Effect

[A newer edition of this post is available here]

In an earlier hack we discovered that the majority of people in the Dictionary of Canadian Biography were categorized as businessmen, office holders, politicians, lawyers and soldiers. The biographies cover a 930-year span, however, which raises the question of which occupations were prevalent at any particular time. One might suspect, for example, that there were relatively fewer explorers and fur traders in more recent years. We can call this the "Adams Effect," after a passage from a letter that John Adams wrote to Abigail Adams in 1780. (Not to be confused with Pepper Adams's final album of the same name.) Anyway, John Adams said

I must study politics and war that my sons may have liberty to study mathematics and philosophy, geography, natural history, naval architecture, navigation, commerce, and agriculture, in order to give their children a right to study painting, poetry, music, architecture, statuary, tapestry, and porcelain.


Does this capture the Canadian experience? Time for another hack. To save time, we scrape the search page and extract the codes for the different volumes and categories. Biographies in four of the twelve volumes haven't been categorized, so there will be some gaps in our data, but we should have enough to see trends. We build a table of data to analyze in a spreadsheet. We can then plot the relative proportion of each occupation over time. (Relative because there are more biographies overall in some volumes.) Also, since some occupations are very common (e.g., member of armed forces) and others are not (e.g., architect) we will want to plot the numbers on a log scale. The following three graphs show which occupations lose, maintain or gain ground over time.







So, is there an "Adams Effect" in Canadian biography? His "war" prediction is spot on: almost 21% of the biographies in Volumes 2 (1701-40) and 4 (1771-1800) are for members of the armed forces. This drops steadily to about 4 or 5% by the late nineteenth century. "Politics," alas, becomes more, rather than less, popular, accounting for about 10% of the biographies in the first half of the nineteenth century, and peaking at about 14% in Volume 11 (1881-90). It is worth emphasizing the fact that the volumes of the DCB are organized by death dates, and that politicians who died in the 1880s may well have been active in the 1860s and 70s (i.e., during the Canadian Confederation and its aftermath.)

How about the next generation? We see a decline in explorers and mariners, in the fur trade and agriculture. Surveying and engineering hold relatively steady. In the final generation we do see an increase in architects, authors and educators. Some predictions (like "statuary," "tapestry" and "porcelain") are harder to check. Overall, not too bad for a statement which was never intended as a prediction.

(26 Sep 2008: links to code updated)

Tags: | | |

Monday, January 16, 2006

Historical Topics in the "Long Tail"

[A newer edition of this post is available here]

Yesterday I posted a hack which allows you to quickly visualize what kinds of people have entries in the online Dictionary of Canadian Biography. I noted that the preponderance are male. They are mostly businessmen, office holders, politicians, lawyers and soldiers. This is comes as no surprise to me. I teach Canadian history and often have the following conversation:

"I'm surprised that I'm really enjoying your course!"
Me: "Why?"
"Because I thought Canadian history was boring."
Me (disingenuously): "Why on earth would you think that?"


But I know why they think that. Until they are exposed to the kinds of questions that practicing historians struggle with, they don't know how exciting the subject can be. (For more on the perception of Canadian history as boring, see Allan Greer's excellent piece from the Ottawa Citizen).

The image that I posted yesterday highlights the typical biographies in the DCB, but the more you study it, the more you begin to see what isn't typical. There are, for example, three slaves categorized as such in the DCB: an Inuk girl named Acoutsina, a young black woman named Marie- Joseph- Angélique, and a black man named Jack York. Now, although there were some slaves in Canadian history, they have not been a very prominant part of the national narrative until relatively recently. Marcel Trudel's 1960 L'esclavage au Canada français was an early exception to the general trend; now it has been joined by works like Denyse Beaugrand-Champagne's Le procès de Marie-Josèphe-Angélique (2004) and Brett Rushforth's forthcoming Savage Bonds: Indigenous and Atlantic Slaveries in New France. The major survey texts, Conrad and Finkel's History of the Canadian Peoples, Francis, Jones and Smith's Origins, and Bumsted's Peoples of Canada now all have sidebars about slaves.

Again, this trend toward social and cultural history will come as no surprise to anyone who has been involved in the profession or taken a university-level history course recently. I see the interest in slaves, however, as an example of what we might call a "long tail" historical topic (to use some Web 2.0 buzzword compliance). That is to say that most of the biographies in the DCB are of businessmen (almost 25%), office holders (about 21%), politicians (18.5%), members of the legal profession (16%) and members of the armed services (almost 18%). On the other hand, people explicitly categorized as slaves account for only .0375% of the entries. (The reason I've been careful to say "categorized as slaves" is that a full-text search of the DCB brings up entries for a number of slaves who were not categorized as such: Joe, Marguerite Duplessis, Marie, Pierre, and others.)

In an earlier post, I suggested that one of the hallmarks of digital history will be that it will increasingly give us access to these long tail topics, if only we can find them. This brings us, however, to the historian's favorite question, "So what?" What is the significance of any individual, any event? In her 1996 presidential address to the AHA, Caroline Bynum said, "For surely what characterizes historians above all else is the capacity to be shocked by the singularity of events in a way that stimulates the search for 'significance'..." Having access to all of these singularities is going to force us to face questions that microhistorians have been facing for years, questions about whether we are dealing with ordinariness or extraordinariness, normality or abnormality, the rule or the exception.

Tags: | | |

Sunday, January 15, 2006

Who is in the Dictionary of Canadian Biography?

[A newer edition of this post is available here]

When I was writing my dissertation in 2003, Libraries and Archives Canada put the Dictionary of Canadian Biography online, and made it freely available to scholars. Since I was in the US at the time, this made my life much easier. Instead of taking the subway to a library that had a copy of the DCB every time I needed to look someone up, I could now get biographical information without interrupting my writing.

The online DCB has many other advantages over the print edition, however. For one thing, the entire text can be searched for keywords. If you are interested in a relatively obscure place that may no longer exist, you can immediately find the biographies that mention that place. If you search for "Fort Chilcotin," for example, you will only find one match, "Klatsassin." Most keywords that appear very infrequently will not make it into a printed index, making them almost impossible to find without full-text searching.

Another advantage of online information is that it can often be made even more useful with a little bit of web programming. (For more on this idea see "Teaching Young Historians to Search, Spider and Scrape.") Thus the first of our digital history hacks.

On the advanced search page of the online DCB, it is possible to click on a volume number, geographical region, gender, or "identification" to see how many biographies match that category. Doing this shows that there are, for example, 450 biographies of females and 7,548 biographies of males. It is also possible to combine categories. There are 15 biographies of female aboriginal people and 229 biographies of male aboriginal people. Exploring the search page in such a desultory fashion can tell you a lot about Canadian historiography. Wouldn't it be nice to be able to automate this exploratory process?

This hack scrapes the search page to extract the codes for each of the identification categories, then 'clicks' each category and grabs the number of matching biographies. The results are then presented as a "tag cloud," a representation where the font size is proportional to the number of hits. The code for the hack was written in Perl and is available here. The tag cloud of entries in the DCB looks like this:



Now what do we see? The vast majority of people in the DCB are businessmen, office holders, politicians, lawyers and soldiers. This, too, says a lot about Canadian historiography. It also suggests a new question: how do the categories change over time? That's another hack for another day.

(26 Sep 2008: link to code was updated)

Tags: | | | |

Saturday, January 07, 2006

Findable Public History

[A newer edition of this post is available here]

I am currently reading Peter Morville's new book Ambient Findability, which is concerned with how people find their way through information and through the world. Many of the ideas that he explores have direct relevance for public history. For example, online marketing is rapidly shifting the balance of power from advertisers to consumers, from 'push' to 'pull'. One implication of this (and Morville's main point) is that 'findability' becomes crucial. Consumers use tools like search engines to find information and evaluate it; if your product is too far down the results page it may as well not exist. Another implication is that there is an effect that Chris Anderson calls the Long Tail. There is a sizable market for things that are only available online. As an example, Anderson and his colleagues estimate that a quarter to a third of Amazon's book sales are for books that are not on their top-100,000 bestsellers list. If the average Barnes & Noble or Chapters carries about 100,000 books, then these are the books that you won't find in stores.

As an academic historian, I can tell you that these include almost all of the books written by my colleagues, and will no doubt include my own book when I get it published. Should we throw up our hands? Not necessarily. In Digital History, Cohen and Rosenzweig argue that

digitization can dramatically increase the use of previously neglected collections by making inaccessible materials easily discoverable. The Making of America collection largely draws from books from the University of Michigan’s remote storage facility that had rarely been borrowed in more than thirty years. Yet researchers now access the same 'obscure' books 40,000 times a month.


What does this mean for public historians? On the one hand, it means that there is probably a significant online audience for any of our work. On the other, it means that we have to be concerned with findability. For every work we produce we should be asking ourselves, "how are people going to get here?"

Tags: | | |