Wednesday, August 30, 2006

Easy Pieces in Python: A Lookup Table

Often you find yourself with data that would be more useful in another form. For example, you might have a list of zip codes that you want to plot with Google Maps. Unfortunately, the Google Maps API only takes latitude and longitude. There is an online table of zip codes with cities, states, latitudes and longitudes at the US Census Bureau website, however. With a little bit of Python it is easy to convert your zip codes to latitude and longitude.

A line in the online data table looks like this:

"06","91436","CA","ENCINO",118.488238,34.15098,13605,0.000457


If we split each line apart at the commas, we will end up with eight fields. Since Python counts from zero, this means the zip code is field number 1. The longitude and latitude are fields 4 and 5 respectively. Our script reads in the data file a line at a time, then creates a dictionary so we can look up each zip code and get a string representing the latitude and longitude. (For Google Maps, we want latitude first, and longitude expressed as a negative number.)

datafile = open('zips.txt', 'r')
lookup = {}
for line in datafile.readlines():
    items = line.split(',')
    zip = items[1]
    lat = items[5]
    lon = items[4]
    lookup[zip[1:-1]] = "%s, -%s" % (lat, lon)

We now can look up individual zip codes:

print lookup['91436']

returns

'34.15098, -118.488238'


Or we can transform all of our data at once:

ziplist = ['02139', '02141', '02142', ...]
print [lookup[z] for z in ziplist]

returns

['42.364688, -71.104155', '42.370701, -71.088277', '42.362025, -71.083011', ...]


Tags: | | | |

Monday, August 28, 2006

Undergraduate Training for Digital History

I suspect that most of my readers are already working in academia or industry. I know of a few, however, who are at the beginning stages of an undergraduate program and are wondering what kind of coursework will provide a solid foundation for historical practice in a digital age. Although digital history (or history and computing or historical informatics) doesn't really exist yet as a separate specialty, many of the necessary techniques are already being taught at most universities and colleges. Furthermore, thanks to an initiative known as OpenCourseWare, it is possible to teach yourself many of these techniques using freely-available courseware from MIT, Johns Hopkins, Carnegie Mellon, Tufts and other schools. (For the benefit of students at my own university, I've also included courses taught at Western.)

So what kind of preparation? The short answer is that you want a double major in history and computer science with some carefully chosen math and linguistics courses. Below I've listed some of the coursework that I've found most useful (or wish that I had taken). Hopefully this post will encourage some other working digital historians or digital humanists to share their own suggestions.
  • Foundational
    • Intro linguistics (MIT 24.900, UWO Linguistics 288 a/b)
    • Statistics (CMU Statistics, UWO Statistical Sciences 222 a/b)
    • Intro computer science with emphasis on programming (MIT 6.001, MIT 1.001, UWO Computer Science 025 a/b, 027 a/b)
  • Intermediate
    • More programming especially web programming (MIT 6.171)
    • Discrete mathematics and/or linear and matrix algebra (MIT 18.06, MIT 18.310, MIT 6.042J, UWO Computer Science 210 a/b)
    • Morphology and syntax (UCLA Ling205, MIT 24.902, MIT 24.951, UWO Anthropology 248 a/b)
    • Corpus linguistics (UC Berkeley Ling290A)
    • History of English or whatever language you do your research in (e.g., U Maryland Engl482, UWO Linguistics 250 f/g)
    • Historical/comparative linguistics (UC Santa Cruz Ling140, Stanford Ling160, UWO Anthropology 335 f/g)
    • Databases (MIT 1.264J, MIT 6.830, UWO Computer Science 319 a/b)
    • Artificial intelligence (MIT 6.034, UWO Computer Science 346 a/b)
    • Geographical Information Systems GIS (MIT 11.188, UWO Geography 383b)
  • Advanced
  • Other (courses that would be great but are less frequently taught)
Plus, of course, lots of whatever kinds of history courses most interest you.


Updates (30-31 Aug 2006)

Paula Petrik writes "Since I'm cross-trained with an MFA, I'd like to add to your list and advise students to take an art course or two, especially in graphic design or typography. The web is a visual medium in many respects, and academics of necessity must 'compete' in this environment. That's not to say that historians must become graphic designers, but a good, clean design does wonders. And some knowledge of typography improves readability and legibility."

Tom Scheinfeldt writes "I'd like to second Paula's comment about the importance of design and display. History is at least as much about presentation of results as it is about research. By the same token, I'd also encourage undergraduates to take at least one public history course. The web is the most public of media, and students should understand the interests and expectations not only of their scholarly peers, but also of the more general public they will surely encounter as they venture onto the web."

Tags: | | |

Friday, August 25, 2006

Digital History Interns 2006

The summer is over and the first five digital history interns at Western have completed the projects that they were working on: Jennifer Bonnell developed a place-based computing system for a field trip at the NiCHE summer school; Tim Compeau created the website An American Refugee: Joel Stone, United Empire Loyalist; Joel Ralph wrote a proof-of-principle demonstration for geocoding articles from The Beaver: Canada's History Magazine; Cat Rose worked on a historic driving tour of the Niagara region; and Rebecca Woods began a text mining project and helped to develop a graduate course in digital history. There is more information on the Digital History at Western internships page. I will be continuing the program next summer; please contact me if you are interested in it.

Tags: | | |

Wednesday, August 23, 2006

Easy Pieces in Python: Removing Stop Words

We continue our exploration of simple Python scripting with another common problem: removing stop words. About forty percent of a given text consists of very common words like 'a', 'the', 'and', etc. While necessary to convey meaning, these words don't distinguish the text from other texts, and thus are usually not very useful for tasks like searching, indexing, determining good keywords, and so on.

As before, we are working with Charles William Colby's The Fighting Governor: A Chronicle of Frontenac (1915) from Project Gutenberg. We will also be using a list of stop words posted online by computer science researchers at Glasgow. For a specific research project, we would want to tune this list, but it is fine for demo purposes. As before, we read both of our text files into lists of words.

textwords = open('cca0710-trimmed.txt', 'r').read().split()
stopwords = open('stop_words.txt', 'r').read().split()

Now we have two options. If we want to maintain some semblance of context, we can replace each stop word with a marker. The following code does exactly that.

markeredtext = []
for t in textwords:
    if t.lower() in stopwords:
        markeredtext.append('*')
    else:
        markeredtext.append(t)

Before replacing stopwords with markers, a sentence from our file looks like this:

['In', 'short,', 'the', 'divine', 'right', 'of', 'the', 'king', 'to', 'rule', 'over', 'his', 'people', 'was', 'proclaimed', 'as', 'loudly', 'in', 'the', 'colony', 'as', 'in', 'the', 'motherland.']


Afterwards, it looks like this:

['*', 'short,', '*', 'divine', 'right', '*', '*', 'king', '*', 'rule', '*', '*', 'people', '*', 'proclaimed', '*', 'loudly', '*', '*', 'colony', '*', '*', '*', 'motherland.']


For other applications, we may not need to use markers. To simply delete stopwords, we can use the following code instead:

filteredtext = [t for t in textwords if t.lower() not in stopwords]

This is what the same sentence looks like when stop words are deleted:

['short,', 'divine', 'right', 'king', 'rule', 'people', 'proclaimed', 'loudly', 'colony', 'motherland.']


Tags: | |

Monday, August 21, 2006

Easy Pieces in Python: Keyword in Context

Yesterday, I showed that it is possible to extract useful information from a historical source (word frequencies) with a few lines of a high-level programming language like Python. Today we continue with another simple demo, keyword in context (KWIC). The basic problem is to split a text into a long list of words, slide a fixed window over the list to find n-grams and then put each n-gram into a dictionary so we can find the contexts for any given word in the text.

As before, we are going to be working with Charles William Colby's The Fighting Governor: A Chronicle of Frontenac (1915) from Project Gutenberg. We start by reading the text file into a long string and then splitting it into a list of words:

wordlist = open('cca0710-trimmed.txt', 'r').read().split()

Next we run a sliding window over the word list to create a list of n-grams. In this case we are going to be using a window of five words, which will give us two words of context on either side of our keyword.

ngrams = [wordlist[i:i+5] for i in range(len(wordlist)-4)]

We then need to put each n-gram into a dictionary, indexed by the middle word. Since we are using 5-grams, and since Python sequences are numbered starting from zero, we want to use 2 for the index.

kwicdict = {}
for n in ngrams:
    if n[2] not in kwicdict:
        kwicdict[n[2]] = [n]
    else:
        kwicdict[n[2]].append(n)

Finally, we will want to do a bit of formatting so that our results are printed in a way that is easy to read. The code below gets all of the contexts for the keyword 'Iroquois'.

for n in kwicdict['Iroquois']:
    outstring = ' '.join(n[:2]).rjust(20)
    outstring += str(n[2]).center(len(n[2])+6)
    outstring += ' '.join(n[3:])
    print outstring

This gives us the following results.

bears, and
 Iroquois knew that
of the
 Iroquois villages. At
with the
 Iroquois at Cataraqui
to the
 Iroquois early in
to the
 Iroquois chiefs, Frontenac
shelter the
 Iroquois from the
wished the
 Iroquois to see
of the
 Iroquois a fort
 ...  
that captured
 Iroquois were burned

This kind of analysis can be useful for historiographical argumentation. If we look at the contexts in which the Iroquois appear in Colby's text, we find that they are usually the objects of verbs rather than the subjects. That is to say that we find a lot of phrases like "to the Iroquois," "make the Iroquois," "overawe the Iroquois," "invite the Iroquois," "with the Iroquois," "smiting the Iroquois," and so on. We find far fewer phrases of the form "[the] Iroquois knew," "the Iroquois rejoiced," or "six hundred Iroquois invaded." This could be taken to suggest that Colby wasn't thinking of the Iroquois as historical agents (which is how most historians see them now) but rather as background characters, as foils for the settlers of New France.

Tags: | | | | |

Sunday, August 20, 2006

Easy Pieces in Python: Word Frequencies

I had originally planned to use Perl with my digital history students but have come to the reluctant conclusion that the language probably isn't ideal for my purposes. Perl has the motto that "there's more than one way to do it," which is fine for experienced programmers but a bit confusing for beginners. So I've made the shift to Python and am very happy so far. When I came across the tutorial on word frequencies in Ruby at Semantic Humanities, I decided it would make a nice demo for Python, too.

The basic problem is to split a text file into an array of words, count the number of occurrences of each word, and return a dictionary sorted by frequency. For my text, I chose Charles William Colby, The Fighting Governor: A Chronicle of Frontenac (1915) available from Project Gutenberg. We start by reading the file into one long string and then use whitespace to split the string into a list of separate words. In Python it looks like this:

input = open('cca0710-trimmed.txt', 'r')
text = input.read()
wordlist = text.split()

or like this if you want to show off:

wordlist = open('cca0710-trimmed.txt', 'r').read().split()

Now that we have our word list, the next step is to create the dictionary. We do this first by counting the number of occurrences of each word in the list:

wordfreq = [wordlist.count(p) for p in wordlist]

Then we pair each word with its corresponding frequency to create the dictionary:

dictionary = dict(zip(wordlist,wordfreq))

Now that we have the dictionary, we can sort it by inverse word frequency and print out the results:

aux = [(dictionary[key], key) for key in dictionary]
aux.sort()
aux.reverse()
for a in aux: print a

This gives us results like the following:

(2574, 'the')
(1394, 'of')
(880, 'to')
(855, 'and')
(572, 'in')
(548, 'was')
(545, 'a')
(420, 'his')
...
(213, 'for')
(212, 'Frontenac')
(209, 'by')
(194, 'not')
...
(76, 'would')
(75, 'Iroquois')
(74, 'upon')
...
(68, 'English')
(68, 'Canada')
(66, 'New')
(65, 'France')
...


Not too hard, eh?

Tags: | | |

Thursday, August 10, 2006

Google as Corpus

Last year, the Economist noted that corpus linguists are increasingly turning to the world wide web as a model of what people actually say ("Corpus colossal," 20 Jan 2005). There are complications, to be sure, many of them caused by the never-ending war on spam. To a good first approximation, however, it is possible to mine the web for examples of common and not-so-common phrasing. Many of these techniques have applications for digital history.

Consider the problem of trying to find a good keyword when searching through sources, or for indexing a compilation of documents. How can we program a computer to figure out what a document is about? We might start with a measure of how frequently particular words occur. (For this we can use one of the TaPOR tools). Suppose we find that document D contains multiple copies of words like 'state' (24 times), 'water' (23), 'Chinese' (17), 'San Francisco' (10), 'steamer' (6) and 'telegraph' (3). The document could be about anything, but a historian might reasonably guess that it was about California in the 19th century.

Frequency is not enough by itself. Document D also contains multiple copies of 'the' (313), 'of' (145), 'and' (144), 'in' (89) and many other words that do little to distinguish it from other documents written in English. As a result, specialists in information retrieval have created a measure called tf-idf, which weights the frequency of a given term by its relative infrequency in the corpus as a whole. Here's where Google comes in. When you search for a term, the search engine responds with the number of pages that contain that term. A couple of years ago, Philipp Lenssen used the Google API to determine the frequency of the 27,693 most used words. At the time, the word 'the' appeared in 522 million pages. The web has grown since then: three years later 'the' appears in about 42 times as many pages.

If we look for words in document D that have a relatively high tf-idf (using Google as our corpus), we can come up with a set of potential keywords, shown in the table below. Note that even the least frequent of these occurs in more than half a million documents on the web.

chinese859,000,000
mud66,900,000
steamer9,680,000
Tuolumne2,070,000
Alviso931,000
Ashburner635,000


When we use the keywords together, however, we can pick our document D out of the billions that Google indexes. The table below shows the number of hits for each, more narrow search.

chinese859,000,000
chinese mud5,830,000
chinese mud steamer141,000
chinese mud steamer Tuolumne621
chinese mud steamer Tuolumne Alviso70
chinese mud steamer Tuolumne Alviso Ashburner1


Used in conjunction with the Google Search API, in other words, a metric like tf-idf can automatically find the set of keywords which will return a given document. We can then find similar documents by relaxing the search a little bit and seeing what else turns up.

Besides frequency, we can also consider the context that a given keyword appears in. One way to do this is by simply searching for seemingly characteristic phrases. "Blunts the keener feelings," for example, is apparently unique to document D, as is "muddy, cheerless and dull beyond telling." There are some specialized tools, like the Linguist's Search Engine, which allow the user to search for grammatical structures. You can tell it that you want to find constructions like "the bigger the house the richer the buyer" and it will return with things like "the darker the coffee bean, then the less caffeine." (See the 2005 paper by Resnik, Elkiss, Lau and Taylor for more info.)

Until now Google has kept most of the raw data to themselves, generously providing an API for researchers to access some of it. On August 3rd, however, they announced that they would be making more than a billion 5-grams available to researchers. An n-gram is a sequence of tokens, words in this case, that you get by sliding a window across text. If S is the previous sentence, then you can generate all of the 5-grams with a one-line Python script

print [S[i:i+5] for i in range(len(S)-4)]

which returns

[['an', 'n-gram', 'is', 'a', 'sequence'],
['n-gram', 'is', 'a', 'sequence', 'of'],
['is', 'a', 'sequence', 'of', 'tokens'],
['a', 'sequence', 'of', 'tokens', 'words'],
['sequence', 'of', 'tokens', 'words', 'in'],
['of', 'tokens', 'words', 'in', 'this'],
['tokens', 'words', 'in', 'this', 'case'],
['words', 'in', 'this', 'case', 'that'],
['in', 'this', 'case', 'that', 'you'],
['this', 'case', 'that', 'you', 'get'],
['case', 'that', 'you', 'get', 'by'],
['that', 'you', 'get', 'by', 'sliding'],
['you', 'get', 'by', 'sliding', 'a'],
['get', 'by', 'sliding', 'a', 'window'],
['by', 'sliding', 'a', 'window', 'across'],
['sliding', 'a', 'window', 'across', 'text']]


The Google research team did something similar on more than a trillion words from the web, and kept every 5-gram that occurred more than forty times. (Which means that "cheerless and dull beyond telling" won't be in the data set, since it occurs only once, in document D.)

So what can you do with n-grams? Lots. Daniel Tauritz, a computer scientist at the University of Missouri-Rolla keeps a clearinghouse of n-gram research. A partial list of applications includes "text compression (1953), spelling error detection and correction (1962), optical character recognition (1967), information retrieval (1973), textual representation (1979), language identification (1991), missing phoneme guessing (1992), information filtering (1993), automatic text categorization (1994), music representation (1998), spoken document retrieval (2000), computational immunology (2000) and medical record matching (2001)." Many of these techniques have clear applications or analogs in historical research.

Tags: | | | | | |

Saturday, August 05, 2006

A Metric for the Popular Imagination

In a paper that I'm currently revising for publication, I introduced the idea of fetishism as a category of analysis in the social sciences by referring to the popular idea of a fetish. One of the volume editors asked me for evidence to back up my claim. How do we know which ideas are popularly held? Surveys, maybe, but it doesn't make sense to do one in this case and my sample size would be far too small anyway.

After some reflection, I realized that I could provide evidence about popular notions of fetishism by surfing for porn doing some statistical linguistics. In "Automatic Meaning Discovery Using Google," Rudi Cilibrasi and Paul Vitanyi provide a metric they call the Normalized Google Distance (NGD). The basic idea is straightforward although the underlying math is deep. If you have two search terms like 'cat' and 'mouse', then each of those terms will appear on some number of pages, and a smaller number of pages will contain both 'cat' and 'mouse'. Intuitively, we expect to find 'cat' with 'mouse' more frequently than 'cat' with 'louse'. The NGD formalizes this idea by providing a measure of how far apart particular terms are in conceptual space.

Time for a hack. I wrote a simple Perl script that uses the Google Search API to calculate the NGD for a pair of terms.

Using this tool, we can provide a measure for the distance between the term 'fetish' and some popular and scholarly associations. (Lower numbers mean the terms are more closely associated.)

latex0.356331874622984
heels0.421554691152762
gag0.497568478291903
choke0.549934320808182
rubber0.553427573822638
leather0.57443530297729
doll0.581254531681847
dungeon0.604959474281258
handcuffs0.621969750564945
smoking0.629600508128091
balloon0.648364151347237
cigar0.715386689730539
fur0.787872974829155
freud0.792196156666702
psychoanalysis0.797465589884342
marx0.8195787086072
krafft-ebing0.955885248436093
commodity1.00639028092102


At this point Google really does constitute what John Battelle called "the database of intentions." More about Google in my next post...

Update (26 Aug 2006): Nicolás Quiroga translated this post into Spanish for his new digital history blog Tapera.

Tags: | | |

Wednesday, August 02, 2006

Microsoft Live Labs Photosynth and Place-Based Computing

One of my research programs is in what I call place-based computing. By combining digital sources with geographic information software (GIS), global positioning receivers (GPS) and handheld or tablet computers, it becomes possible to access historical sources in the place that the source represents. You can stand outside of an old building, for example, and see photographs of how it used to look, census or ownership records, old newspaper articles about it, and so on. We introduced the technology in the Western Public History MA program last year, when our grad students used it to create a technology-enhanced walking tour of a heritage conservation district.

Up until now, most of the sources that we've been using for place-based computing have been geocoded by hand. If we know that a newspaper article refers to a particular building or intersection, we can use a web service like Geocoder.ca to determine the latitude and longitude, and then make an entry for the source in the GIS database. Part of the impetus for text and data mining is to automate this process. (One might, for example, use web services to geocode every place name in the Dictionary of Canadian Biography, then download the entire thing to a handheld. Then when you visited Chicoutimi, you could discover that Dominique Racine was parish priest there from 1862-88, that William Price and Peter McLeod had timber operations there in the 1840s, and so on. In effect, the system would be a travel guide that emphasized Canadian historical biography.)

Techniques for mining text are relatively well-understood compared with those for mining images. At this point, a number of Canadian archives, such as the McCord Museum and the BC Archives, have large repositories of digital historical images. Many of these are images of places, although the metadata does not include tags for latitude and longitude. Ideally, it would be possible to sift through these images automatically, returning ones that matched, say, a present-day photograph of an existing structure.

As a preliminary step in this direction, I got a small grant to experiment with using some off-the-shelf photogrammetry software (PhotoModeler). I used photographs of a present-day building to create a partial 3D model, then imported historical photographs of an adjacent building that had been demolished. The software allowed historical and contemporary photographs to be displayed in the same 3D space, although the work is painstaking, to say the least. (You can read the report here.)

Enter Photosynth, from Microsoft Live Labs. Researchers recently announced a pretty amazing new tool that automatically extracts features from images, uses photogrammetry to stitch them together in three dimensions and then gives you an immersive browser to explore the whole thing. Their demo shows the system with contemporary photographs, but historical photographs should work as well. I can't wait for a chance to play with their software.

Tags: | | | | | | | |