Sunday, October 29, 2006

The Spectrum from Mining to Markup

In a series of earlier posts I've shown that simple text and data mining techniques can be used to extract information from a sample historical source, the online Dictionary of Canadian Biography. With such techniques it is possible to cluster related biographies, to try to determine overall themes or to extract items of information like names, dates and places. Information extraction can be particularly tricky because natural languages are ambiguous. In the DCB, for example, 'Mary' might be a person's name, the name of a ship or the Virgin Mary; 'Champlain' might be the person or any number of geographical features named for him. To some extent these can be disambiguated by clever use of context: 'the Mary' is probably a ship, 'Lake Champlain' is a place (although in a phrase like 'the lake Champlain visited' the word 'Champlain' refers to the person), and so on.

In order to make information explicit in electronic texts, human analysts can add a layer of markup. These tags can then be used to automate processing. I've recently begun a project to tag names and dates in Volume 1 of the DCB using the Text Encoding Initiative XML-based standard TEI Lite. These tags explicitly disambiguate different uses of the same word

his wife <name type="person" reg="Abraham, Mary">Mary</name>
boarded the <name type="ship">Mary</name>

the lake <name type="person" key="34237" reg="Champlain, Samuel de">Champlain</name> visited
visited <name type="place">Lake Champlain</name>

Tags can also be used to add information that is clear to the reader but would be missed during machine processing. When the biography of John Abraham refers to his wife Mary, the person marking up the text can add the information that the person meant is "Abraham, Mary" and not, say, "Silver, Mary." In the case of someone like Champlain who has a biography in the DCB, the person's unique identifier can also be added to the tag. The information that is added in tags can be particularly valuable when marking up dates, as shown below.

<date value="1690">1690</date>
<date value="1690-09">September of the same year</date>
<date value="1690-09-12">twelfth of the month</date>
<date value="1690-06/1690-08" certainty="approx">summer of 1690</date>

In a later pass, my research assistants and I will add latitude and longitude to place name tags. For now, we are concentrating on clarifying dates and disambiguating proper nouns. So we are tagging the names of people ('Champlain'), places ('Lake Champlain'), ships ('the Diligence'), events ('third Anglo-Dutch War'), institutions ('Hudson's Bay Company'), ethnonyms ('the French') and others.

Given texts marked up this way, the next step is to write programs that can make use of the tags. In the Python Cookbook, Paul Prescod writes

Python and XML are perfect complements. XML is an open standards way of exchanging information. Python is an open source language that processes the information. Python excels at text processing and at handling complicated data structures. XML is text based and is, above all, a way of exchanging complicated data structures.

In future posts, I will introduce some of the python code that we are using to process the marked up DCB entries. In the meantime I can suggest a few of the many different kinds of questions that can be answered with these texts:

  • Are discussions of particular ethnic groups limited to ranges of time? Do Basques, for example, play a walk-on part in the early cod fishery only to more-or-less disappear from the story of Canadian history after that?

  • If you start with a particular person and link to all of the people mentioned in his or her biography, and then link to all of the people mentioned in theirs, do you eventually connect to everyone in Volume 1? In other words, is it a small world?

  • If you start with a particular place and time (say Trois-Rivières in 1660) and search for all of the events that happened in the preceding decade within a 50km radius, are they related? If so, how?

The classicist and digital humanist Gregory Crane has recently written that "Already the books in a digital library are beginning to read one another and to confer among themselves before creating a new synthetic document for review by their human readers" [What Do You Do with a Million Books?] This magic is accomplished, in part, by markup. If the system knows which 'Mary' is meant in a particular text it is quite easy to provide links to the same person (or saint, or ship) in other documents in the same digital collection. At the moment we are adding these links by hand, but it is easy to imagine building a system that uses text mining to assign preliminary tags, allows a human analyst to provide correction, then uses that feedback to learn. The Gamera project already provides a framework like this for doing OCR on texts of historical interest.

Tags: | | | | | | |

Sunday, October 15, 2006

Behind the Scenes of a Digital History Site

In a thoughtful post about doing digital history, Josh Greenberg wrote

On an abstract level, I think that there’s a tension between making tools and using tools that comes from a deeper question of audience. When you’re using a tool (or hacking a tool that someone else has already built), there’s a singleminded focus on your own purpose – there’s an end that you want to achieve, and you reach for whatever’s at hand that will (sometimes with a little adjustment) help you get there. When trying to build a tool, on the other hand, there’s a fundamental shift in orientation – rather than only thinking about your own intentions, you have to think about your users and anticipate their needs and desires.

As Josh noted, I've tended to focus on using tools and hacking them in this blog. I haven't been particularly concerned to provide an overall theory of digital history, or even enough background that I could assume that each post would be accessible to everyone in the same way. (I guess the style reflects my own history with Lisp/Scheme and the Unix toolbox). For his part, Josh has been helping to build Zotero, a tool that shows that his concern with the needs and desires of users isn't misplaced.

At a still different level, there is the work that goes into making and maintaining a great digital history site. Dan Cohen and Roy Rosenzweig's book Digital History is an excellent introduction to this part of the field, as is the work that Brian Downey has been doing this year. Brian is the webmaster of the American Civil War site Antietam on the Web. AOTW has all kinds of nice features: an about page that explains their stance on copyright, the site's Creative Commons license and the privacy implications of the site monitoring that they do; an overview of the battle of Antietam with beautiful maps; a timeline that uses the SIMILE API; a database of participants in the battle; transcripts of official reports; a gallery of images; and dozens of other neat things.

At 10 years of age, AOTW is an obvious labor of love and a source of ideas and inspiration. Since March of this year, however, Brian has also been blogging at behind AOTW, "the backwash of a digital history project". The combination of the AOTW site and Brian's blog provide the student of digital history with an unparalleled view behind the scenes of a successful project. In March, for example, Brian posted about footnotes in online history, allowing the reader to compare his code with the implementation on the AOTW site. In another post that month, he discussed copyright and the public domain, something that he has a more-than-academic interest in. In April he laid out a top-down strategy for practicing digital history, continued in June. In July, he discussed the question of whether a site should host advertisements in "Pimping the History Web?" and reviewed some 19th-century online works from the Perseus Project. In August, he implemented a timeline widget and gazeteer for AOTW. This month he has a series of great posts to help someone get started without "an IT shop or a CHNM": tools for putting history online, PHP+database+webserver and jumping in with both feet.

Tags: | | |

Thursday, October 12, 2006

Searching for History

In August 2006, AOL released three months worth of search data for more than half a million of their users, each represented by a random ID number. Within days, the company realized that this was a mistake, withdrew the data and made a public apology. (If you missed the story you can find background information and news articles here.) Many people created copies of the dataset before it was withdrawn and it is still available for download at various mirror sites on the web. Part of the uproar was due to the fact that people had used information like credit card and social security numbers in their searches; in one well-publicized case, a woman was actually identified by the content of her searches.

The AOL researchers intended the data to be used for research purposes, and, in fact, it contains a wealth of information about everyday historical consciousness that is useful for public historians. With the proper tools, the AOL search data can be easily mined to discover what kinds of historical topics people are interested in and how they go about trying to find them. We can then use that information to shape the architecture of our online sites. The results presented below were generated in a couple of hours with off-the-shelf tools.

The AOL data are distributed as a compressed archive which uncompresses to 10 text files totalling about 2.12 Gb. I used a program called TextPipe Pro to extract all of the searches with 'history' in them. I then loaded these into Concordance, another commercial program, to do the text analysis. True to its name, Concordance lets you create concordances and tables of collocations. (Readers of this blog will know that both of these tasks could be easily accomplished with a programming language like Python, but I wanted to show that you don't have to be able to program to do simple data mining.) The process of extracting the searches and creating a concordance for the 57291 tokens of 'history' was very fast. It took less than five minutes on a not-very-expensive desktop computer running Win XP.

Given a concordance, we are in a position to explore what kinds of searches include the word 'history'. For example, suppose someone is interested in US History. They could frame their search in many ways: 'American history', 'history of the United States', and so on. If you are trying to reach users with an online history site, you want to know what kinds of searches they are going to use to get to you. The table below shows the various possibilities that were used by AOL searchers more than fifty times. (Note that I don't include searches for individual states, that the phrase 'American history' is a substring of other phrases like 'African American history' and 'Latin American history', and that the concordance program allows us to search for collocations separated by intervening words.)

american history 998
us history 379
american X history 99
history X american 92
united X history 85
states history 83
history X X america 78
us X history 67
american X X history 63
america history 62

These data seem to indicate a fairly strong preference for the adjectival form. People, in other words, prefer to think of the subject as American or US History rather than the History of the US or of America. The AOL data provide stronger evidence for this search than for most others, but the pattern appears in other regional or national contexts. For example, 'european history' (67) vs. 'history of europe' (3), 'chinese history' (32) vs. 'history of china' (17). More work would obviously be needed to make any kind of strong claim. And some thematic subjects show the opposite pattern, e.g., 'technology history' (4) vs. 'history of technology' (10).

Digging in the search data reveals some unexpected patterns. Some people search for historical topics using a possessive like 'alaska's history' (11), 'canada's history' (5), or 'china's history' (2). When I was adding meta tags to our History Department website, this one never occurred to me but it makes sense in retrospect. If you sort the data by the right context (the stuff that appears after the word 'history') you also find that many people are trying to use dates to limit their searches in a way that most search engines don't allow.

england history 1350 to 1850
french history 1400's
world history 1500 through 1850
world history 1500-1750
women history 1620-1776
italian history 1750's
ancient history 1735 bc
russian history 1880-1900
texas history 1890s law
texas history 1900s news
salvadoran history 1980s
east harlem history 19th century

Unfortunately, searching for '1400's' won't yield dates in the range 1400-1499, it will merely match the literal string '1400's'. Likewise, searching for '1350 to 1850' will only return pages that have '1350' or '1850' in them. Searching for '19th century' will give better results but still miss many relevant documents. I hope that companies that are working on search engines have noticed that people want to do these kind of searches, as it would make the web much more useful for historical research if you could.

The prepositional form really comes into its own for more idiosyncratic searches. Apparently people want to know the histories of

1892 carlsbad austria china teapots
a and w root beer
alfredo sauce
banoffee pie
bingham hill cemetery
blood gangs and hand shakes
celtic body art
coleus plant
dental hygiene in america
do rags
easter egg hunt
emlen physick
everything video game releated
family feud
fat tuesday
girls sweet sixteen birthdays
gorzkie zale
half pipe in snowboarding
hex nuts
impala ss
irrational numbers
jang bo go
k9 german sheppards
l'eggs hosiery
laminated dough
macho man
motion offense in basketball
national pi day
paper marbling
quad rugby
resident evil
residential wiring systems
shaving legs
toilet paper
tv dinners
using wine
v8 juice
vikings ruins in okla
watts towers
yankee doodle dandy
zombie movies

There's no adjectival form for these. The history of sport may be interesting to sport historians, but to whom is the history of hex nuts interesting? More people than you'd think.

Finally, we can also see the AOL users' concern with privacy in the data. The concordance software allows us to see which words appear most frequently 1 position to the left of 'history' and 1 position to the right, 2 positions to the left and 2 positions to the right, and so on. The left collocations are most informative in this case. We find, for example, that 'clear' is the third most frequently appearing word 1 position to the left of 'history'. Fourteen hundred and thirteen different searches included the phrase 'clear history', 563 the phrase 'delete history' and 208 'erase history'. If we look 2 positions to the left, we find more searches with similar intent 'clear X history' (510), 'delete X history' (353) and 'erase X history' (140). Ironically, many of these searches also include 'AOL' as a collocate, e.g., 'how do i delete my history on aol' or 'remove my history from aol'. The table below summarizes these collocations.

 my 199 my 1836history
clear 123 clear 510 clear 1413 history
aol 41 aol 238 aol 668 history
delete 71 delete 353 delete 563 history
  search 382 history
 erase 140 erase 208 history
  browser 132 history

Of course, this is merely the tip of the iceberg. Much more could be found by studying, for example, how people use dates in searches or what kinds of things they are looking for when they visit particular websites.

Tags: | | | | | | |

Tuesday, October 10, 2006

Tapera-DHH Survey of History Blogs

I've been working with Nicolás Quiroga of Tapera on a survey of history blogs. (To be fair, Nicolás has actually been doing most of the work.) Anyway, he has created a number of graphs of the preliminary results and posted them to his blog [1, 2, 3]. We also have a wiki page for the project on the Western Digital History server. The wiki is set up so that anyone can read it, but you need an account to edit it. I'm happy to provide access to the wiki to other digital historians who are interested in playing with or extending the results. If you'd like to participate in the ongoing blog survey, please mail your answers to the questions to Nicolás at

Friday, October 06, 2006

Zotero Beta Launched

In previous posts I've discussed the great new research tool Zotero [1, 2]. The public beta of the software launched yesterday, with a new website, a blog, user forums and greatly extended documentation including a wiki for developers. Zotero's creators have been busy in the few weeks since I reviewed the pre-release beta. They've added support for reusing tags, made it easier to add notes to saved sources and added a bunch of new fields to the bibliographic records. As before, the interface is clean and quite intuitive and the program works smoothly when you need it and doesn't get in your way when you don't. It's a beautiful piece of work.

Something I hadn't noticed before: Zotero uses the OpenURL framework to provide support for context-sensitive services. This means that you can tell the program to locate a source that you are interested in, and it will look for it in your local library.

The feature list gives you some idea of where Zotero is going (and where you can help take it). Planned features include shared collections, remote library backup, advanced search and data mining tools, a recommendation engine with RSS feeds and word processor integration. Zotero is already much more than bibliographic management software. It is a "platform for new forms of digital research that can be extended with other web tools and services." And it rocks.


Tuesday, October 03, 2006

On N-gram Data and Automated Plagiarism Checking

In August, Google announced that they would be releasing a massive amount of n-gram data at minimal cost (see "All Our N-gram are Belong to You").

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.

In brief, an n-gram is simply a collocation of words that is n items long. "In brief" is a bigram, "a collocation of words" is a 4-gram, and so on. For more information, see my earlier post on "Google as Corpus."

The happy day is here. For US $150 you can order the six DVD set of Google n-gram data from the Linguistic Data Consortium. While waiting for my copy to arrive, I figured that I could take this opportunity to suggest that the widespread availability of such data is going to force us to rethink the idea of plagiarism, especially the idea that plagiarism can be detected in a mechanical fashion.

My school, for example, subscribes to a service called Turnitin. On their website, Turnitin claims that their software "Instantly identifies papers containing unoriginal material." That's a pretty catchy phrase. So catchy, in fact, that it appears, mostly unquoted, in 338 different places on the web, usually in association with the Turnitin product, but also occasionally to describe their competitors like MyDropBox.

In the old days, say 2001, educators occasionally used Google to try and catch suspected plagiarizers. They would find a phrase that sounded anomalous in the student's written work and type it into Google to see if they could find an alternate source. I haven't heard anyone claim to have done that recently, for a pretty simple reason. Google now indexes too much text to make this a useful strategy.

Compared with Google, Turnitin is a mewling and puking infant (N.B. allusion, not plagiarism). At best, the company can only hope for the kind of comprehensive text archive that massive search engines have already indexed. With this increase in scale, however, comes a kind of chilling effect. Imagine if your word processor warned you whenever you tried to type a phrase that someone else had already thought of. You would never write again. (Dang! That sentence has already been used 343 times. And I know that I read an essay by someone on exactly this point, but for the life of me I can't locate it to cite it.)

What Google's n-gram data will show is that it is exceedingly difficult to write a passage that doesn't include a previously-used n-gram. To demonstrate this, I wrote a short Python script that breaks a passage of text into 5-grams and submits each, in turn, to Google to make sure that it doesn't already appear somewhere on the internet.

My university's Handbook of Academic and Scholarship Policy includes the following statement, which provides a handy test case.

NOTE: The following statement on Plagiarism should be added to course outlines:
“Plagiarism: Students must write their essays and assignments in their own words. Whenever students take an idea, or a passage from another author, they must acknowledge their debt both by using quotation marks where appropriate and by proper referencing such as footnotes or citations. Plagiarism is a major academic offence (see Scholastic Offence Policy in the Western Academic Calendar).”

Here are the number of times that various 5-grams in this statement have been used on the web, sorted by frequency:

5740 "should be added to course"
1530 "idea or a passage from"
1480 "assignments in their own words"
1400 "where appropriate and by proper"
1380 "or a passage from another"
1270 "an idea or a passage"
1120 "and assignments in their own"
0923 "plagiarism is a major academic"
0774 "a passage from another author"
0769 "essays and assignments in their"
0704 "students must write their essays"
0635 "they must acknowledge their debt"
0628 "must write their essays and"
0619 "write their essays and assignments"
0619 "marks where appropriate and by"
0606 "acknowledge their debt both by"
0605 "is a major academic offence"
0596 "both by using quotation marks"
0595 "appropriate and by proper referencing"
0588 "policy in the western academic"
0585 "and by proper referencing such"
0585 "referencing such as footnotes or"
0585 "scholastic offence policy in the"
0583 "must acknowledge their debt both"
0579 "by using quotation marks where"
0573 "such as footnotes or citations"
0572 "proper referencing such as footnotes"
0570 "using quotation marks where appropriate"
0561 "their debt both by using"
0553 "take an idea or a"
0549 "debt both by using quotation"
0549 "in the western academic calendar"
0548 "see scholastic offence policy in"
0546 "offence policy in the western"
0544 "quotation marks where appropriate and"
0503 "by proper referencing such as"
0492 "their essays and assignments in"
0490 "note the following statement on"
0479 "in their own words whenever"
0453 "whenever students take an idea"
0452 "from another author they must"
0442 "students take an idea or"
0432 "another author they must acknowledge"
0389 "citations plagiarism is a major"
0385 "their own words whenever students"
0377 "passage from another author they"
0373 "own words whenever students take"
0368 "or citations plagiarism is a"
0366 "footnotes or citations plagiarism is"
0366 "a major academic offence see"
0355 "as footnotes or citations plagiarism"
0353 "the following statement on plagiarism"
0348 "major academic offence see scholastic"
0338 "offence see scholastic offence policy"
0333 "academic offence see scholastic offence"
0179 "plagiarism students must write their"
0096 "plagiarism should be added to"
0066 "following statement on plagiarism should"
0062 "be added to course outlines"
0033 "statement on plagiarism should be"
0030 "on plagiarism should be added"

Beyond the mechanical, there are a lot of murky conceptual problems with plagiarism. To claim that the core value of scholarship has always been to respect the property rights of the individual author is wildly anachronistic. (For a more nuanced view, see Anthony Grafton's Forgers and Critics and Defenders of the Text.) A simpleminded notion of plagiarism also makes it difficult to explain any number of phenomena we find in the actual (as opposed to normative) world of text: Shakespeare, legal boilerplate, folktales, oral tradition, literary allusions, urgent e-mails about Nigerian banking opportunities and phrases like "all our n-gram are belong to you."

In a 2003 article in the AHR, Roy Rosenzweig wrote about the difficulties that historians and other scholars will face as they move from a culture of scarcity to one of abundance. In many ways, this transition has already occurred. It's time to stop pretending that prose must always be unique, or that n-grams can be property. All your prose are belong to us.

Tags: | | |