Thursday, August 10, 2006

Google as Corpus

Last year, the Economist noted that corpus linguists are increasingly turning to the world wide web as a model of what people actually say ("Corpus colossal," 20 Jan 2005). There are complications, to be sure, many of them caused by the never-ending war on spam. To a good first approximation, however, it is possible to mine the web for examples of common and not-so-common phrasing. Many of these techniques have applications for digital history.

Consider the problem of trying to find a good keyword when searching through sources, or for indexing a compilation of documents. How can we program a computer to figure out what a document is about? We might start with a measure of how frequently particular words occur. (For this we can use one of the TaPOR tools). Suppose we find that document D contains multiple copies of words like 'state' (24 times), 'water' (23), 'Chinese' (17), 'San Francisco' (10), 'steamer' (6) and 'telegraph' (3). The document could be about anything, but a historian might reasonably guess that it was about California in the 19th century.

Frequency is not enough by itself. Document D also contains multiple copies of 'the' (313), 'of' (145), 'and' (144), 'in' (89) and many other words that do little to distinguish it from other documents written in English. As a result, specialists in information retrieval have created a measure called tf-idf, which weights the frequency of a given term by its relative infrequency in the corpus as a whole. Here's where Google comes in. When you search for a term, the search engine responds with the number of pages that contain that term. A couple of years ago, Philipp Lenssen used the Google API to determine the frequency of the 27,693 most used words. At the time, the word 'the' appeared in 522 million pages. The web has grown since then: three years later 'the' appears in about 42 times as many pages.

If we look for words in document D that have a relatively high tf-idf (using Google as our corpus), we can come up with a set of potential keywords, shown in the table below. Note that even the least frequent of these occurs in more than half a million documents on the web.


When we use the keywords together, however, we can pick our document D out of the billions that Google indexes. The table below shows the number of hits for each, more narrow search.

chinese mud5,830,000
chinese mud steamer141,000
chinese mud steamer Tuolumne621
chinese mud steamer Tuolumne Alviso70
chinese mud steamer Tuolumne Alviso Ashburner1

Used in conjunction with the Google Search API, in other words, a metric like tf-idf can automatically find the set of keywords which will return a given document. We can then find similar documents by relaxing the search a little bit and seeing what else turns up.

Besides frequency, we can also consider the context that a given keyword appears in. One way to do this is by simply searching for seemingly characteristic phrases. "Blunts the keener feelings," for example, is apparently unique to document D, as is "muddy, cheerless and dull beyond telling." There are some specialized tools, like the Linguist's Search Engine, which allow the user to search for grammatical structures. You can tell it that you want to find constructions like "the bigger the house the richer the buyer" and it will return with things like "the darker the coffee bean, then the less caffeine." (See the 2005 paper by Resnik, Elkiss, Lau and Taylor for more info.)

Until now Google has kept most of the raw data to themselves, generously providing an API for researchers to access some of it. On August 3rd, however, they announced that they would be making more than a billion 5-grams available to researchers. An n-gram is a sequence of tokens, words in this case, that you get by sliding a window across text. If S is the previous sentence, then you can generate all of the 5-grams with a one-line Python script

print [S[i:i+5] for i in range(len(S)-4)]

which returns

[['an', 'n-gram', 'is', 'a', 'sequence'],
['n-gram', 'is', 'a', 'sequence', 'of'],
['is', 'a', 'sequence', 'of', 'tokens'],
['a', 'sequence', 'of', 'tokens', 'words'],
['sequence', 'of', 'tokens', 'words', 'in'],
['of', 'tokens', 'words', 'in', 'this'],
['tokens', 'words', 'in', 'this', 'case'],
['words', 'in', 'this', 'case', 'that'],
['in', 'this', 'case', 'that', 'you'],
['this', 'case', 'that', 'you', 'get'],
['case', 'that', 'you', 'get', 'by'],
['that', 'you', 'get', 'by', 'sliding'],
['you', 'get', 'by', 'sliding', 'a'],
['get', 'by', 'sliding', 'a', 'window'],
['by', 'sliding', 'a', 'window', 'across'],
['sliding', 'a', 'window', 'across', 'text']]

The Google research team did something similar on more than a trillion words from the web, and kept every 5-gram that occurred more than forty times. (Which means that "cheerless and dull beyond telling" won't be in the data set, since it occurs only once, in document D.)

So what can you do with n-grams? Lots. Daniel Tauritz, a computer scientist at the University of Missouri-Rolla keeps a clearinghouse of n-gram research. A partial list of applications includes "text compression (1953), spelling error detection and correction (1962), optical character recognition (1967), information retrieval (1973), textual representation (1979), language identification (1991), missing phoneme guessing (1992), information filtering (1993), automatic text categorization (1994), music representation (1998), spoken document retrieval (2000), computational immunology (2000) and medical record matching (2001)." Many of these techniques have clear applications or analogs in historical research.

Tags: | | | | | |