Saturday, September 09, 2006

What We Need Now Is a Good Trolling Engine...

One thing that is difficult to do with a traditional search engine is find documents that were written at a particular time (the new Google News Archive Search being a notable exception). Suppose, for example, that you are starting a research project on the environmental history of nineteenth-century gold rushes in North America. Are there good collections of online primary sources that you should know about? Of course, but they can be hard to find. It would be great to be able to limit your Google searches to documents written during particular date ranges, e.g., 1848-55 (for the California gold rush), 1858-65 (Cariboo) or 1896-99 (Klondike).

This turns out to be more difficult than you might think at first. Google Advanced Book Search allows you to specify a publication date range. So a search for "gold california date:1848-1855" returns books like Walter Colton's Three Years in California (1850), which you can actually download as a PDF. But other books are not going to show up, like A Doctor's Gold Rush Journey to California by Israel S. Lord, which was written from 1849 to 1851 but not published until 1995. In cases like these, you are searching through metadata rather than through the document itself. Most of the material on the web doesn't have enough metadata to be really satisfactory for this kind of searching.

Furthermore, depending on the project you may not always have good search terms. Suppose you are thinking of becoming a digital medievalist and want to get some idea of what kinds of sources you might be able to work with. How do you search for machine-readable documents written in Old English? Obviously you will try to make use of the traditional scholarly apparatus and of online resource guides like The ORB.

To supplement this kind of activity, I'm thinking it would very nice to have what I'm going to call a "trolling engine," a tool that can sift through the Internet on a more-or-less continuous basis and return items that match a particular set of criteria determined by a human analyst. You would set it up, say, to look for documents written during the Cariboo gold rush, or written in Old English around the time of King Alfred, or those that may have be written by ornithologists in the West Midlands in the 1950s (if you're interested in the latter, you're in luck).

So how would a trolling engine work? In present-day search engines, spiders scour the web downloading pages. A massive inverse index is created so that there is a link running from each term on every page back to the page itself. Once this blog post is indexed by Google's spiders, for example, there will be links to it in their inverse dictionary from "trolling," "engine," "spiders" and many other terms. The catch is that there is not a lot of other publicly-accessible information associated with each term. Suppose, however, that Google also tagged each term with its part-of-speech and parsed all the text in the surrounding context. Then you would be able to search for items in a particular syntactic frame. As Dan Brian showed in an interesting article, you could search for all instances of "rock" used as an intransitive verb, and find sentences like "John thought San Francisco rocked" without finding ones like "The earthquake rocked San Francisco." There is already a pretty cool program called The Linguist's Search Engine that lets you do this kind of searching over a corpus of about 3.5 million sentences.

In fact, being able to search the whole web for words in particular syntactic frames could be a very powerful historical tool for a simple reason: languages change over time. Take "sort of/kind of." For at least six hundred years, English speakers have been using these word sequences in phrases like "some kind of animal," that is, as a noun followed by a preposition. By the nineteenth century, "sort of" and "kind of" also appeared as degree modifiers: "I kind of think this is neat." In a 1994 Stanford dissertation, Whit Tabor showed that between the 16th and 19th centuries, "sort of" and "kind of" increasingly appeared in syntactic frames where either reading makes sense. That is, "kind of good idea" might be interpreted as [kind [of [good idea]]] or [[[kind of] good] idea]. So if you find a document that uses "sort of" or "kind of" as a degree modifier, you have one clue that it was probably written sometime after 1800. (See the discussion in Manning and Schütze for more on this example.)

It's not just these two word sequences that have a history. Every word, every collocation has a history. A word like "troll" is attested as a verb in the fourteenth century and as a noun in the seventeenth. Its use as a fishing term also dates from the seventeenth century. If your document is about trolls it was probably written after 1600; if it is about trolling, it could have been written earlier (see my post on "A Search Engine for 17th-Century Documents"). By itself, the earliest attested date of a single word or collocation is weak evidence. If we were to systematically extract this kind of information from a very large corpus of dated documents, however, we could create a composite portrait of documents written in AD 890 or during the Cariboo gold rush or at any other given time.

A similar logic would help us find documents written by ornithologists. In this case, the training corpus would have to be tagged with a different kind of metadata in addition to the date: the occupation of the author. Once we had that we could discover that two words that appear separately on millions of web pages, "pair" and "nested", occur quite rarely as the collocation "pair nested." That's the kind of thing an ornithologist would write.

Tags: | | | | |