Digital History Hacks (2005-08): September 2006

Thursday, September 28, 2006

No 'Secret Syllabus' for Digital History

My colleague Rob MacDougall recently suggested that we teach to two syllabi, the one we give to students and a "secret" one:

Every course we teach has two syllabi, I think. There’s the visible one, the actual list of readings and topics we assign to our students. And then there’s the secret syllabus, made up of whatever assortment of books and articles we also happen to be reading while teaching the course. These are the various bees and bats in our belfries and bonnets, the things we’re chewing on as we walk into the classroom, the new interpretations and the rediscovered classics that get us fired up about a topic we may have taught several times before.

It's a fun observation and it rings true for me. It explains the asides that I bring to my survey lectures and the discussions that I have with students about the discrepancies between what I want to talk about in class and what it says in the text. When I talk about the fur trade, I have to make sure that I talk about the staples thesis, but I am really fired up about the kinds of questions that Carolyn Podruchny has been asking. Why did voyageurs practice a mock baptismal rite but not, say, a mock communion? What does this tell us about how they understood space and place? What was the connection between the aboriginal windigo and the European werewolf? (For these and much more, see her wonderful forthcoming book Making the Voyageur World). When I talk about the Jesuits, I have to make sure that they know about Paul Le Jeune and the Jesuit Relations. But in the back of my mind I'm thinking of Peter Goddard's discussion of the degree to which the Jesuits of New France actually believed in the agency of demons.

But not every course has a secret syllabus. Notably, I don't have one for my digital history grad class. The things that I put in the syllabus this year are exactly the things I am struggling with right now. In 20 years, perhaps, when digital history is an established field with a hundred visions and revisions ... maybe then I will feel the impatient tension between what I need to tell them and what I want to tell them. But for now all we have are the visions. When one of my tech-savvy students tells me in a bemused way that he has never felt so lost in all his life, I can agree. Me either. Right now digital history is an exploration. We don't know what we're going to find. We don't even have a map, never mind a secret one. That is why it is such a great time to become a digital historian.

Tags: digital history | voyageurs

Sunday, September 24, 2006

Student Reflections on Digital History

Tuesday, September 19, 2006

Extending Zotero for Collaborative Work

In an earlier post I reviewed the forthcoming open-source Firefox extension Zotero. In brief, Zotero is able to automatically extract citation information from a web page that you are browsing and store it in a database. It also lets you organize and search through your research notes and do a number of other useful things. Since it is open source, users are free to develop the software and add new features. In my review I suggested a few features that could be added, such as support for RSS feed aggregation and spidering / text mining within the browser.

Here I'd like to speculate a little bit more about the kinds of things that Zotero could be used for, this time concentrating on scholarly collaboration. In the version that I reviewed, Zotero stores citation information in a local SQLite database. It also allows you to import and export information in a variety of XML-based standard forms. Putting these two features together, it should be straightforward to send XML records over the web so that they could be stored in a nonlocal SQL database. Imagine, then, that in the Zotero collections panel you could have nonlocal folders that were automatically synchronized to shared external databases. You could subscribe to a bibliographic community with a particular interest and receive citations to sources as well as contribute your own. Such communities might form around a particular project (e.g., the references to be cited in an edited volume or jointly-authored textbook) or around a particular event (e.g., Hurricane Katrina) or an emerging research field (e.g., digital history). Since Zotero also allows you to work with research notes, PDFs, images, and other kinds of files it would be possible to synchronize most of the files associated with a particular project amongst collaborators. It would also be easy to flag information that a particular user had already seen, so that new items could be visually coded to draw attention. (In the Sage RSS feed aggregator, for example, feeds that haven't been updated are in normal font and those with new information are in boldface.)

Tags: browser | computer supported collaborative work | hacking | open source | xml | zotero

Saturday, September 09, 2006

What We Need Now Is a Good Trolling Engine...

One thing that is difficult to do with a traditional search engine is find documents that were written at a particular time (the new Google News Archive Search being a notable exception). Suppose, for example, that you are starting a research project on the environmental history of nineteenth-century gold rushes in North America. Are there good collections of online primary sources that you should know about? Of course, but they can be hard to find. It would be great to be able to limit your Google searches to documents written during particular date ranges, e.g., 1848-55 (for the California gold rush), 1858-65 (Cariboo) or 1896-99 (Klondike).

This turns out to be more difficult than you might think at first. Google Advanced Book Search allows you to specify a publication date range. So a search for "gold california date:1848-1855" returns books like Walter Colton's Three Years in California (1850), which you can actually download as a PDF. But other books are not going to show up, like A Doctor's Gold Rush Journey to California by Israel S. Lord, which was written from 1849 to 1851 but not published until 1995. In cases like these, you are searching through metadata rather than through the document itself. Most of the material on the web doesn't have enough metadata to be really satisfactory for this kind of searching.

Furthermore, depending on the project you may not always have good search terms. Suppose you are thinking of becoming a digital medievalist and want to get some idea of what kinds of sources you might be able to work with. How do you search for machine-readable documents written in Old English? Obviously you will try to make use of the traditional scholarly apparatus and of online resource guides like The ORB.

To supplement this kind of activity, I'm thinking it would very nice to have what I'm going to call a "trolling engine," a tool that can sift through the Internet on a more-or-less continuous basis and return items that match a particular set of criteria determined by a human analyst. You would set it up, say, to look for documents written during the Cariboo gold rush, or written in Old English around the time of King Alfred, or those that may have be written by ornithologists in the West Midlands in the 1950s (if you're interested in the latter, you're in luck).

So how would a trolling engine work? In present-day search engines, spiders scour the web downloading pages. A massive inverse index is created so that there is a link running from each term on every page back to the page itself. Once this blog post is indexed by Google's spiders, for example, there will be links to it in their inverse dictionary from "trolling," "engine," "spiders" and many other terms. The catch is that there is not a lot of other publicly-accessible information associated with each term. Suppose, however, that Google also tagged each term with its part-of-speech and parsed all the text in the surrounding context. Then you would be able to search for items in a particular syntactic frame. As Dan Brian showed in an interesting article, you could search for all instances of "rock" used as an intransitive verb, and find sentences like "John thought San Francisco rocked" without finding ones like "The earthquake rocked San Francisco." There is already a pretty cool program called The Linguist's Search Engine that lets you do this kind of searching over a corpus of about 3.5 million sentences.

In fact, being able to search the whole web for words in particular syntactic frames could be a very powerful historical tool for a simple reason: languages change over time. Take "sort of/kind of." For at least six hundred years, English speakers have been using these word sequences in phrases like "some kind of animal," that is, as a noun followed by a preposition. By the nineteenth century, "sort of" and "kind of" also appeared as degree modifiers: "I kind of think this is neat." In a 1994 Stanford dissertation, Whit Tabor showed that between the 16th and 19th centuries, "sort of" and "kind of" increasingly appeared in syntactic frames where either reading makes sense. That is, "kind of good idea" might be interpreted as [kind [of [good idea]]] or [[[kind of] good] idea]. So if you find a document that uses "sort of" or "kind of" as a degree modifier, you have one clue that it was probably written sometime after 1800. (See the discussion in Manning and Schütze for more on this example.)

It's not just these two word sequences that have a history. Every word, every collocation has a history. A word like "troll" is attested as a verb in the fourteenth century and as a noun in the seventeenth. Its use as a fishing term also dates from the seventeenth century. If your document is about trolls it was probably written after 1600; if it is about trolling, it could have been written earlier (see my post on "A Search Engine for 17th-Century Documents"). By itself, the earliest attested date of a single word or collocation is weak evidence. If we were to systematically extract this kind of information from a very large corpus of dated documents, however, we could create a composite portrait of documents written in AD 890 or during the Cariboo gold rush or at any other given time.

A similar logic would help us find documents written by ornithologists. In this case, the training corpus would have to be tagged with a different kind of metadata in addition to the date: the occupation of the author. Once we had that we could discover that two words that appear separately on millions of web pages, "pair" and "nested", occur quite rarely as the collocation "pair nested." That's the kind of thing an ornithologist would write.

Tags: n-grams | perpetual analytics | search | spidering | text mining | trolling engine

Thursday, September 07, 2006

A First Look at Zotero

Our school year officially started today but I'm not teaching on Thursdays this term, so I was able to spend the day hacking the pre-release beta of Zotero and listening to an album of pioneering electronic music. The music turned out to be the perfect complement to the software.

The basics. Zotero is the brainchild of a team of digital historians at the Center for History and New Media at George Mason University: Dan Cohen, Josh Greenberg, Simon Kornblith, David Norton and Dan Stillman. Their basic goal was to create a freely available, open source tool that would put the essential functions of standalone bibliography software like Endnote into the Firefox browser. Since we already spend most of the day reading and writing in our browsers (e-mail, blogging, newsfeeds, online journals, e-books, library catalogs, etc.) this makes a lot of sense. Like commercially available packages, Zotero allows you to create and cite from a database of primary and secondary references of various types (books, newspaper articles, journal articles, and so on). Instead of starting a separate program, however, you can enter records as you browse library catalogs (e.g., Library of Congress, WorldCat), bookstores (Amazon.com) and many other websites.

Zotero already has some distinct advantages over commercial bibliographic management software. For one thing, you can arrange your sources hierarchically. The interface is divided into three panels which give you different views into your collections of sources, using the familiar file folder metaphor. The lefthand panel shows a top level view of folders, the centre panel shows the contents of the currently selected folder, and the righthand panel shows a tabbed display of the details for the currently selected item. You can see a screenshot of the interface in the quick start guide. It is easy to customize the information presented in the middle panel. Zotero not only allows you to create bibliographic records, but also makes it easy to link to webpages, to snapshots of webpages, to other files like JPEGs and PDFs, and to notes which you can create directly in your browser. You can tag records with Library of Congress subject headings (LCSH) or with your own tags, or a mixture of the two. You can also link records within your collections to one another. (I have to admit that I haven't quite figured out a use for this.) The interface also easily toggles between views that take up all, some or none of the browser page. Finally, there is a feature called "smart collections" which lets you save the results of a search as a virtual folder. This is handy because it gives you different views of the same data without requiring you to enter it in multiple places.

Sensing citation information. Let's take it as read that Zotero is a great tool for keeping track of your bibliographical information without leaving the browser. But there's more. When you browse a page that has citation information embedded in it, Zotero "senses" that and lets you know. You then have the option of automatically scraping some or all of the data to your bibliographic database. The beta version already supports this behaviour at a number of widely used sites like the Library of Congress, WorldCat, Amazon.com and the New York Times. In my trial runs, it worked perfectly at the Library of Congress and Amazon, and with a few hiccups at a number of other sites. Since Zotero is extensible, expect to see user-contributed scrapers start to appear as soon as the beta is released. (More on this below). In my own university's library catalog, I had to switch to MARC view, and then Zotero worked perfectly. But then scrapers are notoriously brittle.

Hacking Zotero. Zotero exports to RDF/XML and imports from a number of XML-based standards (RDF, MARC, MODS and RIS). Since it is pretty easy to write programs to manipulate RDF/XML in high-level programming languages, it will be possible for digital historians to collect resources via browsing in Zotero, then automate the processing of those records. It will also be possible to write programs that collect raw data (e.g., spiders), do some processing and then write the output in a format that can be imported into Zotero and scanned by a human interpreter. In other words, your Zotero collection (or anyone else's, or a whole bunch of people's) can be part of a workflow that includes both people and machines. This will be very useful for text and data mining projects.

Behind the scenes. (This part of the review can be safely skipped if you aren't going to modify the program yourself). Since Zotero is open source, it is possible to look at the code and see how it works. Then hack it. Zotero installs two folders in your Firefox profile, one called "zotero" that holds your data, and one called "extensions/zotero@chnm.gmu.edu" that holds the source code. The former contains a SQLite database that Firefox (and thus Zotero) uses to hold client-side information. You can download a SQLite utility that allows you to interact with the tables WHEN YOU ARE NOT RUNNING FIREFOX. (Otherwise you run the risk of hosing your database.) With this utility you can enter commands like ".schema" to see the SQL statements needed to create the database, or "select * from tags" which shows you the tags you have already created. Modifications to the Zotero code can be done in a few places, notably the files "schema.sql" and "scrapers.sql". If you wanted to add record types to your database, for example, you'd have to modify the former. The scrapers are written in JavaScript and stored in the database. Presumably, the stable release of Zotero will include some tutorials showing how to write simple scrapers, but an intrepid programmer can probably figure it out from the supplied code. (And more. You can almost feel Kornblith's pain in one of his plaintive comments: "// Why can''t amazon use the same stylesheets".)

Notes for the Zotero team. Don't let my suggestions for future improvements distract you from the bottom line: Zotero is already an amazing piece of software that could change the way we do history. The visionary team at CHNM should really be congratulated for making this, and making it open source. But since it is a beta release...

There is a time stamp when records are added to the database or modified, which is great. The same information isn't readily available, however, when one takes a snapshot of a webpage.
Library of Congress scraper: want to automatically harvest links like URLs. Future releases (or hacks) could build on this by including the ability to spider within Zotero.
WorldCat scraper: should grab OCLC number and put it into call number field. Again, this is crucial for automated spidering.
Geocoding: please, please add fields for latitude and longitude. Geocoded sources are worth having, and I want to mash them up with the Google Maps API and MapServer.
Record types: at the very least, we need a generic record type to hold odds and ends. Ideally there would also be new types for blogs, blog posts, archival fonds, generic material objects and audio cds.
Tags: when adding tags, should have a lookup table so you can select one that you've already used (good for consistency).
Tags: nice to have a way of browsing by tag (as in del.icio.us), probably over in the lefthand panel.
RSS feeds: it would be awesome if the functionality of Sage was built into the lefthand pane. Many of the sources I want to cite these days are blog posts.

Tags: hacking | zotero

Monday, September 04, 2006

Nature's Metropolis and Its Hinterland

Fifteen years ago William Cronon published Nature's Metropolis: Chicago and the Great West, a book that has become something of a classic in environmental history and many other fields. In the book, Cronon shows how the history of nineteenth-century Chicago wasn't merely the history of a city, but rather the history of the relations between a city and the hinterland that it dominated. This was a fairly novel perspective since the history of the American West has often been viewed in terms of a frontier. (Nature's Metropolis seems less revolutionary from the perspective of Canadian historiography, which already had a metropolitan-hinterland thesis, but it's a great book nonetheless.)

Thinking idly about the historiographical significance of Cronon's book, I got to wondering what kind of hinterland it dominates. That is to say, where do we find publically accessible copies of Nature's Metropolis? Fortunately, the new Open WorldCat makes it quite easy for a digital history hacker to answer a question like this. First we search for the locations of libraries where the book is held and scrape the addresses. Then we use a lookup table to map the zip codes to latitude/longitude pairs. Finally, we hand the whole thing off to Google Maps to plot the results, shown below.

Sunday, September 03, 2006

Easy Pieces in Python: Simple Scraping

Information on the web is increasingly made available in forms that are easy for machines to process, often variants of XML. Many sites, however, still present information in a way that looks good to human readers but is more difficult to process automatically. In cases like these, it is nice to be able to write a scraper to extract the information of interest.

As an example, suppose we are working with OCLC's fabulous new Open WorldCat, which allows web users to search the catalogs of over 10,000 different libraries with a single search. Among other things, the system allows the user to find all of the different works created by an author and to locate copies of any given work. Open WorldCat is designed for human searchers, however, so you have to do a bit of programming if you want to automatically extract information from their pages.

Most scraping starts with the programmer looking at the webpage of interest and comparing it with the code that generated it. You can do this in Firefox by going to the page and choosing View -- Page Source (in Internet Explorer the command is View -- Source). The basic idea is to locate instances of the information that you are interested in and study the context of these in the code. Is there something that always precedes or follows the thing you are interested in? If so, you can write a regular expression to match that context.

Suppose we are looking for works by a particular author, say Alan MacEachern. The WorldCat URLs for his author page, for a particular book (The Institute of Man and Resources), and for all copies of that book in Ontario are shown below. Since we are programming in Python, we assign them to variables.


authorurl = r'http://www.worldcat.org/search?q=alan+maceachern'
workurl = r'http://www.worldcat.org/oclc/51839396'
locationurl = r'http://www.worldcat.org/oclc/51839396&tab=holdings?loc=ontario#tabs'

Having studied the source code of those pages, we've also determined the patterns that we will need to extract works from the author page and library addresses from the location pages. (I admit that I'm reaching a bit when I call this part "easy"... I guess it is easy if you already know how to use regular expressions. Hang in there.)


workpattern = '<a.*?href="/oclc/(.*?)\&.*?".*?>.*?</a>'
addresspattern = r'<td class="location">(.*?)</td>'

Now we need some code that will open a webpage, pass our pattern across it line by line, and return any matches. Since we will want to reuse this code whenever we need to scrape a page, we wrap it in a function.


import re, urllib
def scraper(url, filter=r'.*'):
  page = urllib.urlopen(url)
  pattern = re.compile(filter, re.IGNORECASE)
  returnlist = []
  for line in page.readlines():
    returnlist += pattern.findall(line)
  return returnlist

So, how does it work? We can now ask it to return the OCLC numbers for all works created by Alan MacEachern:


r = scraper(authorurl, workpattern)
print r

And this is what we get:

['44713451', '51839396', '61175031', '40538595', '46521636']

And we can ask it to return the addresses of all of the libraries in Ontario that have a copy of The Institute of Man and Natural Resources:


r = scraper(locationurl, addresspattern)
print r

And this is what we get:

['Waterloo, ON N2L 3C5 Canada', 'Ottawa, ON K1A 0N4 Canada', 'Hamilton, ON L8S 4L6 Canada', 'Kingston, ON K7L 5C4 Canada', 'Toronto, ON M5B 2K3 Canada', 'Guelph, ON N1G 2W1 Canada', 'London, ON N6A 3K7 Canada']

By modifying the URLs and patterns that we feed into our scraper, we can accomplish a wide variety of scraping tasks with a small amount of code.

Tags: hacking | python | scraping | WorldCat | XML

Friday, September 01, 2006

Digital History Blogging Update

In June I posted a roundup of digital history blogs with the idea of highlighting some of the interesting work that is currently being done and maybe getting people to contact me if they were feeling left out. Since then, we've been joined by Nicolás Quiroga, who blogs about digital history in Spanish at Tapera. Nicolás recently spidered the blogrolls of the digital history sites on my original roundup and plotted some interesting visualizations. If my reading knowledge of French and Babelfish haven't completely completely let me down, he also asked if someone is willing to hack a more sophisticated (Babelfish: "less rustic") spider and make the source code publically accessible. Consider the gauntlet to be thrown down.

I was also contacted by Joseph Reagle who blogs about Open Communities, Media, Source and Standards and is working on a dissertation on the collaborative culture of Wikipedia at NYU.

Last but not least, I've also been following Semantic Humanities, a blog about web technology and humanities scholarship that digs into scripting and markup and occasionally breaks into code.

Tags: blogs | digital history

Digital History Hacks (2005-08)