Digital History Hacks (2005-08): 2006

Wednesday, December 27, 2006

Bent Circuits and Buzz Machines

It's midwinter here, a good time for indoor activities. Since I like electronica but don't have any musical training or talent, I decided to spend some time circuit bending and playing with buzz machines.

It's fairly easy to get started bending circuits. Go to a junk store and buy some battery-powered toys that make sounds or music. Take one apart to expose the circuit and put some new batteries in. Now, while the thing is making noise, try short-circuiting it at random with a piece of wire. When you get interesting results, mark the relevant connections with a felt pen. Later you can go back and try inserting switches, variable resistors or other components at these points, or even new circuits. When you've got something that makes a whole range of unexpected sounds, you build it back into the original case or put it in a new housing. The prevailing aesthetic seems to prefer modding the original case, as in Reed Ghazala's Incantor. Voila, a new musical instrument! Some toys seem to lend themselves to the process more readily than others; I didn't have much luck until my third attempt. As you might imagine, the whole thing is very hackish and aleatoric. (For more information, see Ghazala's Circuit Bending and/or Collins's Handmade Electronic Music.)

Building musical instruments from circuits is pretty low-level. Buzz machines simulate these kind of instruments and many others, allowing the user to virtually plug together hundreds of different kinds of noise generators, synthesizers, filters, delays and so on, to create layered and complex compositions. This is done at a level that is more familiar to most musicians, using tracks, beats and notes.

Both bent circuits and buzz machines lend themselves to a kind of feedback-driven exploration. If you have something that sounds cool, you can make a note then try adjusting the circuit: change connections, add or remove components, tweak something. If you like the results, keep going. If not, back up and try something else.

Sitting at the workbench, fiddling with wires and inhaling molten solder and burnt plastic, I've had plenty of time to muse about digital history. At this point our discipline is where music was in the first half of the twentieth century. There is a long, classical tradition that most historians know and work in. There are a few new technologies that are familiar and widespread... we might think of, say, word processors the way that they thought of vinyl records. There are a few technologies that are becoming familiar but aren't taken very seriously (wiki = theremin?) While the more avant-garde are playing with the equivalent of tape loops or prepared pianos, there hasn't yet been widespread adoption and endless modification of these and other techniques. They're still the domain of Brian Eno and not of Madonna. This makes it difficult to teach digital history, because students have to be simultaneously exposed to low-level stuff that is still far from their day-to-day research (e.g., programming in Python) and high-level stuff that is potentially relevant but tangential (e.g., data mining). What's in-between doesn't exist yet.

Tags: buzz machines | circuit bending | digital history | electronica | hacking

Tuesday, December 26, 2006

Digital History Year in Review

In August, AOL released some of their search data for more than half a million users. Among other things, it gives us some idea of what people are looking for when they search for history. Dan Cohen continued to underline the importance of APIs for the humanities.

Google digitized millions of Books, leading Gregory Crane and others to wonder what you can do with a million books. The Open Content Alliance had same plan, different vision.

Cyberspace, home of Collective Intelligence and Convergence Culture? Or just a stupid and boring hive mind? Either way, I think it has interesting implications for pedagogy.

Digital history: graduate seminars (1, 2, 3) and a bunch of new bloggers.

Google Earth added historical maps.

Josh Greenberg decided to leave his blog Epistemographer fallow, with good results. Meanwhile, Tom Schienfeldt highlighted the "unintentional, unconventional and amateur" in Found History. Emma Tonkin argued that Folksonomies are just plain-text tagging under a new name.

Niall Ferguson wrote about historical gaming and simulation in New York magazine; Esther MacCallum-Stewart (Break of Day in the Trenches) and Gavin Robinson (Investigations of a Dog) extended the discussion in interesting new directions. At the end of the year, I was musing about alternate reality games.

Sheila Brennan began an ongoing survey of online History museums at Relaxing on the Trail.

I is for Information Aesthetics.

(Just realized why this kind of thing can be tiresome.)

Keyword in context (KWIC) is easy to implement, but Jeffrey Garrett argued that it doesn't really fit with the way humanists think.

Brian Hayes discussed the search for the one, true programming Language in American Scientist. It's LISP.

Dan Cohen noted that Machines are often the audience these days. He also wrote about data mining, provided an example in his blog, and warned that we shouldn't allow available tools to guide our inquiry. I spent much of the year developing tools to mine the Dictionary of Canadian Biography and other online repositories.

In September, Google released N-gram data to researchers. I argued that it could be put to use in digital history, and might help to change our notions of plagiarism.

O is for old. Just when you thought it was the new new, it turned out to be weird weird.

Manan Ahmed penned a Polyglot Manifesto in two parts. I switched from Perl to Python.

Q is for the <q> element, of course. Paula Petrik guided historians through the ins and outs of (X)HTML and CSS at HistoryTalk.

Choudhury and colleagues demonstrated a new tool for document recognition.

Exactly one year ago, in my first substantive post to this blog I suggested that historians be taught to search, spider and scrape. I still think it's necessary. Re: searching, Dan Cohen discussed the appeal of the single box, and Phil Bradley wrote about the past and future of search engines.

Tags: Mills Kelly thought they could be used to subvert the archive. TEI: Keith Alexander talked about authoring directly in it. Timeline: the SIMILE group released a "DHTML-based AJAXy widget."

Alan MacEachern started a series called "the Academic Alphabet" in University Affairs: "A is for admissions," "B is for books," and so on. Although he occasionally talks about digital history, he won't get to U until 2008. Good luck, buddy.

Tim Burke wrote about the history of virtual worlds in Easily Distracted.

W is for Wikipedia. Roy Rosenzweig wrote about whether history can be open source; Joseph Reagle blogged about his dissertation project on Wikipedia; Mills Kelly thought up classroom assignments; John Jordan discussed national variants. The New Yorker and Atlantic ran stories, too.

I discovered the joys of XAMPP when I set up the new digital history server at Western. Jeff Barry wrote a similarly positive review at Endless Hybrids.

Yahoo! Term Extraction, a handy way to find keywords in content.

And finally, in September I got my first look at the cool new Zotero. Expect it to play a big role in digital history next year.

Tags: digital history

Wednesday, December 13, 2006

Pedagogy for Collective Intelligence

In my last post I suggested that digital history might be able to harness the power of what Pierre Lévy calls "collective intelligence," a group of self-selected, networked individuals who work together to solve problems that are far beyond the capacity of any one person. In such a setting, knowledge becomes valuable when it is shared, and the strength of the collective depends on the diversity of skills and information that individuals can bring to it. Theorists like Lévy, Henry Jenkins and others suggest that such collectives may already be flourishing in online fandom and massively multiplayer games. To take a single example that Jenkins discusses, when children play with elements from the multibillion-dollar transmedia franchise Pokémon, they enter a world that is too complicated for any child to understand by him or herself. "There are several hundred different Pokémon, each with multiple evolutionary forms and a complex set of rivalries and attachments," Jenkins writes. "There is no one text where one can go to get the information about these various species; rather, the child assembles what they know about the Pokémon from various media with the result that each child knows something his or her friends do not and thus has a chance to share this expertise with others" (Convergence Culture, 128). He continues

Children are being prepared to contribute to a more sophisticated knowledge culture. So far, our schools are still focused on generating autonomous learners; to seek information from others is still classified as cheating. Yet, in our adult lives, we are depending more and more on others to provide information we cannot possess ourselves. Our workplaces have become more collaborative; our political process has become more decentered; we are living more and more within knowledge cultures based on collective intelligence. Our schools are not teaching what it means to live and work in such knowledge communities, but popular culture may be doing so.

How do we go beyond the autonomous learner? I've only begun teaching, so I don't have any ready answers. For the past few years I've been experimenting with assignments that require all of the students to work on related topics and make use of online archives of sources. These assignments have been relatively successful: the students can share information with one another, talk over interpretations, and answer each others' questions. At the end of the day, however, they've had to go off on their own and write a paper to hand in ... mostly because I haven't figured out how to assign grades to a collective intelligence in a way that isn't going to get me fired.

Our public history MA program is different from my undergraduate classes because it stresses teamwork in a community context. The students do some individual work but they also contribute to museum exhibits, websites, walking tours and other kinds of joint project. This year, Alan and I have both noticed that there is a new level of cooperation and cohesiveness among the students. I'm sure he wouldn't put it this way, but I think they've been acting more as a collective intelligence. There are two things that we're doing differently. For one thing, the students had a digital history grad class this year where they read a lot about Web 2.0 stuff (tagging, social search, mashups, open source, the economy of reputation, and so on). So they've been exposed to an ideology of collective intelligence and some of the tools that can facilitate it. We're also making much more extensive use of online, collaborative software. The students contributed to a shared archive by digitizing sources, and designed their part of a museum exhibit using a wiki. They've also been blogging and responding to each other online.

I think there are three areas that need more thought. First, the tools for collaboration could be greatly improved. (Personally, I hope that an extended Zotero could serve as the basis for future interaction.) Second, we have to get beyond the idea of autonomous learners. I'm not sure how long it will take the academy to make this transition, but individual academics can surely get involved with projects that teach the skills needed for these new knowledge communities. Third, we have to try to open up academic collectives so that they can mesh with ones outside the academy.

Tags: collective intelligence | pedagogy | public history | rhizome

Monday, December 11, 2006

Collective Intelligence and ARGs

With my digital history grad seminar done for the year, I find myself mulling over two things. One is the idea of histories of the future, the subject of my last post. The other is what the French theorist Pierre Lévy calls "collective intelligence." In Henry Jenkins's formulation, "None of us can know everything; each of us knows something; and we can put the pieces together if we pool our resources and combine our skills" (Convergence Culture). I use Jenkins's version of the idea because he develops it in a fascinating discussion of alternate reality gaming.

In 2001, a secret Microsoft team known as the Puppetmasters put together a new kind of game called the "Beast." The puzzle posed by the game narrative, which would be delivered in fragments via every medium the designers could think of, would require the cooperation of many players to solve. It eventually was solved by the Cloudmakers, a self-selected team of hundreds of players. Jenkins says, "From the start, the puzzles were too complex, the knowledge too esoteric, the universe too vast to be solved by any single player." (For more, see Sean Stewart's introduction to the game, an academic paper by Jane McGonigal, and Cloudmakers.org).

Setting aside the fact that the Beast sounds like it was a heck of a lot of fun, it seems to me that digital history could harness this kind of collective problem solving if framed in the right way. This is already the stuff of fiction. In Vernor Vinge's Rainbows End, for example, computing is ubiquitous, high school students take courses in "search and analysis" and collectives of intelligence analysts work together in swarm-like fashion. One of Jenkins's provocative claims is that games and other media are already teaching children to work together in this fashion to solve problems. "In a hunting culture," he says, "kids play with bows and arrows. In an information society, they play with information."

Tags: alternate reality games | collective intelligence

Friday, December 08, 2006

Histories of the Future

My grad seminar in digital history wrapped up this week with a discussion of "histories of the future." As I explained to the students, I was trying to capture three things with the (syntactically ambiguous) title. In a course about history and computing, I thought it might be nice if we discussed some readings on the history of computing. More generally, I was also responding to an intriguing essay collection by the same title, which looks at a few of the different ways that people in the past imagined what was to come. And finally, I also wanted to provide a space to talk about how history may come to be done in the digital age.

As I've mentioned before, digital history is new enough that there's no real gap yet between the frontiers of research and classroom discussions. So for me, the idea of "histories of the future" represents a problem that I'm struggling with. Many of the histories that are being written right now don't really reflect the present, at least not my present. (In his wonderful Blessed Among Nations, Eric Rauchway uses the metaphor of an eyeglass prescription that no longer makes things clear.) I have the sense that if I could only figure out how to relate the two senses of the phrase "histories of the future," I'd know what history should look like in the present.

Tags: digital history | macdougall, rob: influence of

Sunday, November 26, 2006

The Difference That Makes a Difference

I recently had a conversation with some colleagues about a PhD student in our program who is close to finishing his dissertation on a 20th-century topic. One of them expressed concern that he might be embarassed if he didn't consult a particular set of sources, that it would suggest that his research wasn't exhaustive enough. I was reminded of a conversation that I had with my supervisor Harriet Ritvo when I was first starting my own doctoral research.

"How do you know when to stop doing research," I asked her, "Don't you keep finding new sources?" "Of course," she said. "You always find new material. Your research is done when it stops making a difference to your interpretation." Since then, I've found that a number of historians that I admire have a similarly pragmatic criterion. In The Landscape of History, for example, John Gaddis writes, "Some years ago I asked the great global historian William H. McNeill to explain his method of writing history to a group of social, physical and biological scientists attending a conference I'd organized. He at first resisted doing this, claiming that he had no particular method. When pressed, though, he described it as follows:

I get curious about a problem and start reading up on it. What I read causes me to redefine the problem. Redefining the problem causes me to shift the direction of what I'm reading. That in turn further reshapes the problem, which further redirects the reading. I go back and forth like this until it feels right, then I write it up and ship it off to the publisher. (p.48)

If the idea of exhaustive or complete research ever made sense, it doesn't any longer. Around the time that I asked Harriet about how I would know when I had finished my research, I also did a Google search for "Chilcotin," the place I was writing about. I got about 2,000 hits and looked at each one of them. If I were to start the same project today I would find far too much material to wade through. (Google now indexes 543,000 pages that mention the Chilcotin.) More to the point, new material is spidered by Google at an exponential rate, whereas my ability to read through it is increasing in a linear fashion at best. By the time I finished the study there would be far more unread material than when I started it.

One of the consequences of the infinite archive (or of what Roy Rosenzweig calls the "culture of abundance") is that we can't wait to run out of sources before ending a line of inquiry. We can't even pretend to do so. Instead, we have to focus on how much new information we are getting, on average, from what we're learning. Following the work of CS Peirce and William James, Gregory Bateson famously defined information as "the difference that makes a difference." When it stops making a difference, it's no longer information.

Tags: bits | convergence | information | pragmatism

Wednesday, November 15, 2006

In the Trading Zone

I'm afraid I haven't posted much this month because I've been working on a big grant application with a number of colleagues. One component of the grant is dedicated to digital infrastructure for environmental history, and as part of the grant-writing process I've been required to negotiate a number of partnerships with institutions that are creating digital repositories of various sorts. In an odd reversal, I've been finding it much easier to talk to librarians, archivists, curators and new media specialists than to my own colleagues in environmental history. I now realize that this is because I share a language with the former that the latter have not yet adopted: the language of open source, open access, digital libraries, markup, and web services.

Two other recent experiences have helped me put this in perspective. Yesterday in my digital history grad class we were talking about machine learning and data mining. Some of the students wondered whether we would let available tools and techniques guide research questions, basically agreeing with a concern raised by Dan Cohen earlier in the week. I was reminded of Maslow's remark that "If you only have a hammer, you tend to see every problem as a nail." (The Psychology of Science: A Reconaissance, New York 1966). One of the ways that historians often differ from more theory-driven social scientists is by commitments to holism, nuanced messiness and complexity. Exceptions don't prove the rule; they show that rules are always impoverished.

I've also been reading Fred Turner's new book From Counterculture to Cyberculture (Chicago 2006). Talking about the emergence of an interdisciplinary group of scientists, engineers, and entrepreneurs he draws on ideas from Peter Gallison ('trading zone') and Geoffrey Bowker ('legitimacy exchange'):

Legitimacy exchange helped transform cybernetics from a relatively local contact language suited to the particular needs of scientists in wartime Cambridge into a discourse commonly used for coordinating work across multiple research projects and multiple professional communities. As Bowker suggests, cybernetics facilitated not only the interlinking of research, development, and production activities, but also the development of new interpersonal and interinstitutional networks and, with them, the exchange and generation of a networked form of power. To the extent that members of two or several disciplines could succeed in creating a relatively closed system of interlegitimation, they could make it extraordinarily difficult for nonexperts (i.e., noncyberneticians) to challenge their individual agendas. They could and did stake claims for research funding, material resources, and popular attention. Working together, in pairs and networks, each acquired a legitimacy that none could have had alone without the exchange of legitimacy afforded by cybernetic rhetoric (pp. 25-26).

Gallison's ethnographic metaphor of a trading zone leaves room for the very different understandings that humanities scholars and intelligence analysts will bring to their encounter. In Image and Logic he writes "Anthropologists are familiar with different cultures encountering one another through trade, even when the significance of the objects traded--and of the trade itself--may be utterly different for the two sides. And with the anthropologists, it is crucial to note that nothing in the notion of trade presupposes some universal notion of a neutral currency. Quite the opposite, much of the interest of the category of trade is that things can be coordinated (what goes with what) without reference to some external gauge" (p. 803). So much better to enter the trading zone than to wait and form a cargo cult.

Tags: digital history

Thursday, November 09, 2006

Wikis and Philological Criticism

In "Digital Maoism: The Hazards of the New Online Collectivism," Jaron Lanier recently wrote that "Reading a Wikipedia entry is like reading the bible closely. There are faint traces of the voices of various anonymous authors and editors, though it is impossible to be sure." Among other things, Lanier was arguing against the idea of the collective as "all-wise," something that Wikipedia's many other detractors have also tended to concentrate on. What most of these discussions miss is a feature of Wikipedia and other wikis that is, to my mind at least, what makes them so interesting: wikis automatically generate and maintain an extensive philological apparatus that is always available to the user.

It's fitting that Lanier should refer to close reading of the bible. After all, generations of textual scholars honed their critical skills on religious and classical texts, using exactly such 'faint traces' to determine the origin and authorship of sources, relate variants to one another and repair corruption. Such criticism, sometimes known as 'external' or 'lower' criticism was prelude to 'internal' or 'higher' criticism which "render[ed] a verdict ... on the source's significance as historical evidence" [Ritter, Dictionary of Concepts in History, s.v. "Criticism"; see also Greetham's excellent Textual Scholarship.]

Wiki software automatically tracks every single change made to a given page and lists the edits on a corresponding history page. For contested encyclopedia articles where there may be thousands of edits, it is possible to use sophisticated visualization tools like History Flow to study insertions, deletions, rearrangements, authorship, edit wars, and so on. Furthermore, every wiki page is also accompanied by a discussion page where authors and readers can annotate the texts. Both of these features are very valuable for close reading; neither is available in any of the more traditional encyclopedias that I'm familiar with.

Far too much emphasis has been placed on the content of Wikipedia, and not nearly enough on the practices of reading that it supports. Our students should be using wikis to learn the philological underpinnings of their craft, not told to avoid them because they are a 'bad' source.

Tags: history flow | philology | reading | Wikipedia

Sunday, October 29, 2006

The Spectrum from Mining to Markup

In a series of earlier posts I've shown that simple text and data mining techniques can be used to extract information from a sample historical source, the online Dictionary of Canadian Biography. With such techniques it is possible to cluster related biographies, to try to determine overall themes or to extract items of information like names, dates and places. Information extraction can be particularly tricky because natural languages are ambiguous. In the DCB, for example, 'Mary' might be a person's name, the name of a ship or the Virgin Mary; 'Champlain' might be the person or any number of geographical features named for him. To some extent these can be disambiguated by clever use of context: 'the Mary' is probably a ship, 'Lake Champlain' is a place (although in a phrase like 'the lake Champlain visited' the word 'Champlain' refers to the person), and so on.

In order to make information explicit in electronic texts, human analysts can add a layer of markup. These tags can then be used to automate processing. I've recently begun a project to tag names and dates in Volume 1 of the DCB using the Text Encoding Initiative XML-based standard TEI Lite. These tags explicitly disambiguate different uses of the same word

his wife <name type="person" reg="Abraham, Mary">Mary</name>
versus
boarded the <name type="ship">Mary</name>

the lake <name type="person" key="34237" reg="Champlain, Samuel de">Champlain</name> visited
versus
visited <name type="place">Lake Champlain</name>

Tags can also be used to add information that is clear to the reader but would be missed during machine processing. When the biography of John Abraham refers to his wife Mary, the person marking up the text can add the information that the person meant is "Abraham, Mary" and not, say, "Silver, Mary." In the case of someone like Champlain who has a biography in the DCB, the person's unique identifier can also be added to the tag. The information that is added in tags can be particularly valuable when marking up dates, as shown below.

<date value="1690">1690</date>
<date value="1690-09">September of the same year</date>
<date value="1690-09-12">twelfth of the month</date>
<date value="1690-06/1690-08" certainty="approx">summer of 1690</date>

In a later pass, my research assistants and I will add latitude and longitude to place name tags. For now, we are concentrating on clarifying dates and disambiguating proper nouns. So we are tagging the names of people ('Champlain'), places ('Lake Champlain'), ships ('the Diligence'), events ('third Anglo-Dutch War'), institutions ('Hudson's Bay Company'), ethnonyms ('the French') and others.

Given texts marked up this way, the next step is to write programs that can make use of the tags. In the Python Cookbook, Paul Prescod writes

Python and XML are perfect complements. XML is an open standards way of exchanging information. Python is an open source language that processes the information. Python excels at text processing and at handling complicated data structures. XML is text based and is, above all, a way of exchanging complicated data structures.

In future posts, I will introduce some of the python code that we are using to process the marked up DCB entries. In the meantime I can suggest a few of the many different kinds of questions that can be answered with these texts:

Are discussions of particular ethnic groups limited to ranges of time? Do Basques, for example, play a walk-on part in the early cod fishery only to more-or-less disappear from the story of Canadian history after that?

If you start with a particular person and link to all of the people mentioned in his or her biography, and then link to all of the people mentioned in theirs, do you eventually connect to everyone in Volume 1? In other words, is it a small world?

If you start with a particular place and time (say Trois-Rivières in 1660) and search for all of the events that happened in the preceding decade within a 50km radius, are they related? If so, how?

The classicist and digital humanist Gregory Crane has recently written that "Already the books in a digital library are beginning to read one another and to confer among themselves before creating a new synthetic document for review by their human readers" [What Do You Do with a Million Books?] This magic is accomplished, in part, by markup. If the system knows which 'Mary' is meant in a particular text it is quite easy to provide links to the same person (or saint, or ship) in other documents in the same digital collection. At the moment we are adding these links by hand, but it is easy to imagine building a system that uses text mining to assign preliminary tags, allows a human analyst to provide correction, then uses that feedback to learn. The Gamera project already provides a framework like this for doing OCR on texts of historical interest.

Tags: data mining | dictionary of canadian biography | link analysis | python | tagging | text encoding initiative | text mining | xml

Sunday, October 15, 2006

Behind the Scenes of a Digital History Site

In a thoughtful post about doing digital history, Josh Greenberg wrote

On an abstract level, I think that there’s a tension between making tools and using tools that comes from a deeper question of audience. When you’re using a tool (or hacking a tool that someone else has already built), there’s a singleminded focus on your own purpose – there’s an end that you want to achieve, and you reach for whatever’s at hand that will (sometimes with a little adjustment) help you get there. When trying to build a tool, on the other hand, there’s a fundamental shift in orientation – rather than only thinking about your own intentions, you have to think about your users and anticipate their needs and desires.

As Josh noted, I've tended to focus on using tools and hacking them in this blog. I haven't been particularly concerned to provide an overall theory of digital history, or even enough background that I could assume that each post would be accessible to everyone in the same way. (I guess the style reflects my own history with Lisp/Scheme and the Unix toolbox). For his part, Josh has been helping to build Zotero, a tool that shows that his concern with the needs and desires of users isn't misplaced.

At a still different level, there is the work that goes into making and maintaining a great digital history site. Dan Cohen and Roy Rosenzweig's book Digital History is an excellent introduction to this part of the field, as is the work that Brian Downey has been doing this year. Brian is the webmaster of the American Civil War site Antietam on the Web. AOTW has all kinds of nice features: an about page that explains their stance on copyright, the site's Creative Commons license and the privacy implications of the site monitoring that they do; an overview of the battle of Antietam with beautiful maps; a timeline that uses the SIMILE API; a database of participants in the battle; transcripts of official reports; a gallery of images; and dozens of other neat things.

At 10 years of age, AOTW is an obvious labor of love and a source of ideas and inspiration. Since March of this year, however, Brian has also been blogging at behind AOTW, "the backwash of a digital history project". The combination of the AOTW site and Brian's blog provide the student of digital history with an unparalleled view behind the scenes of a successful project. In March, for example, Brian posted about footnotes in online history, allowing the reader to compare his code with the implementation on the AOTW site. In another post that month, he discussed copyright and the public domain, something that he has a more-than-academic interest in. In April he laid out a top-down strategy for practicing digital history, continued in June. In July, he discussed the question of whether a site should host advertisements in "Pimping the History Web?" and reviewed some 19th-century online works from the Perseus Project. In August, he implemented a timeline widget and gazeteer for AOTW. This month he has a series of great posts to help someone get started without "an IT shop or a CHNM": tools for putting history online, PHP+database+webserver and jumping in with both feet.

Tags: blogs | digital history | history web | web programming

Thursday, October 12, 2006

Searching for History

In August 2006, AOL released three months worth of search data for more than half a million of their users, each represented by a random ID number. Within days, the company realized that this was a mistake, withdrew the data and made a public apology. (If you missed the story you can find background information and news articles here.) Many people created copies of the dataset before it was withdrawn and it is still available for download at various mirror sites on the web. Part of the uproar was due to the fact that people had used information like credit card and social security numbers in their searches; in one well-publicized case, a woman was actually identified by the content of her searches.

The AOL researchers intended the data to be used for research purposes, and, in fact, it contains a wealth of information about everyday historical consciousness that is useful for public historians. With the proper tools, the AOL search data can be easily mined to discover what kinds of historical topics people are interested in and how they go about trying to find them. We can then use that information to shape the architecture of our online sites. The results presented below were generated in a couple of hours with off-the-shelf tools.

The AOL data are distributed as a compressed archive which uncompresses to 10 text files totalling about 2.12 Gb. I used a program called TextPipe Pro to extract all of the searches with 'history' in them. I then loaded these into Concordance, another commercial program, to do the text analysis. True to its name, Concordance lets you create concordances and tables of collocations. (Readers of this blog will know that both of these tasks could be easily accomplished with a programming language like Python, but I wanted to show that you don't have to be able to program to do simple data mining.) The process of extracting the searches and creating a concordance for the 57291 tokens of 'history' was very fast. It took less than five minutes on a not-very-expensive desktop computer running Win XP.

Given a concordance, we are in a position to explore what kinds of searches include the word 'history'. For example, suppose someone is interested in US History. They could frame their search in many ways: 'American history', 'history of the United States', and so on. If you are trying to reach users with an online history site, you want to know what kinds of searches they are going to use to get to you. The table below shows the various possibilities that were used by AOL searchers more than fifty times. (Note that I don't include searches for individual states, that the phrase 'American history' is a substring of other phrases like 'African American history' and 'Latin American history', and that the concordance program allows us to search for collocations separated by intervening words.)

american history 998
us history 379
american X history 99
history X american 92
united X history 85
states history 83
history X X america 78
us X history 67
american X X history 63
america history 62

These data seem to indicate a fairly strong preference for the adjectival form. People, in other words, prefer to think of the subject as American or US History rather than the History of the US or of America. The AOL data provide stronger evidence for this search than for most others, but the pattern appears in other regional or national contexts. For example, 'european history' (67) vs. 'history of europe' (3), 'chinese history' (32) vs. 'history of china' (17). More work would obviously be needed to make any kind of strong claim. And some thematic subjects show the opposite pattern, e.g., 'technology history' (4) vs. 'history of technology' (10).

Digging in the search data reveals some unexpected patterns. Some people search for historical topics using a possessive like 'alaska's history' (11), 'canada's history' (5), or 'china's history' (2). When I was adding meta tags to our History Department website, this one never occurred to me but it makes sense in retrospect. If you sort the data by the right context (the stuff that appears after the word 'history') you also find that many people are trying to use dates to limit their searches in a way that most search engines don't allow.

england history 1350 to 1850
french history 1400's
world history 1500 through 1850
world history 1500-1750
women history 1620-1776
italian history 1750's
ancient history 1735 bc
russian history 1880-1900
texas history 1890s law
texas history 1900s news
salvadoran history 1980s
east harlem history 19th century

Unfortunately, searching for '1400's' won't yield dates in the range 1400-1499, it will merely match the literal string '1400's'. Likewise, searching for '1350 to 1850' will only return pages that have '1350' or '1850' in them. Searching for '19th century' will give better results but still miss many relevant documents. I hope that companies that are working on search engines have noticed that people want to do these kind of searches, as it would make the web much more useful for historical research if you could.

The prepositional form really comes into its own for more idiosyncratic searches. Apparently people want to know the histories of

1892 carlsbad austria china teapots
a and w root beer
acne
alfredo sauce
banoffee pie
bingham hill cemetery
blood gangs and hand shakes
celtic body art
chakras
coleus plant
dental hygiene in america
do rags
easter egg hunt
emlen physick
everything video game releated
family feud
fat tuesday
girls sweet sixteen birthdays
gorzkie zale
half pipe in snowboarding
hex nuts
impala ss
irrational numbers
jang bo go
jay-z
k9 german sheppards
kissing
l'eggs hosiery
laminated dough
macho man
motion offense in basketball
myspace
nalgene
national pi day
oreos
paper marbling
patzcuaro
quad rugby
resident evil
residential wiring systems
salads
shaving legs
toilet paper
trolls
tv dinners
ultrasound
using wine
v8 juice
vikings ruins in okla
watts towers
wifebeaters
xbox
yankee doodle dandy
zombie movies

There's no adjectival form for these. The history of sport may be interesting to sport historians, but to whom is the history of hex nuts interesting? More people than you'd think.

Finally, we can also see the AOL users' concern with privacy in the data. The concordance software allows us to see which words appear most frequently 1 position to the left of 'history' and 1 position to the right, 2 positions to the left and 2 positions to the right, and so on. The left collocations are most informative in this case. We find, for example, that 'clear' is the third most frequently appearing word 1 position to the left of 'history'. Fourteen hundred and thirteen different searches included the phrase 'clear history', 563 the phrase 'delete history' and 208 'erase history'. If we look 2 positions to the left, we find more searches with similar intent 'clear X history' (510), 'delete X history' (353) and 'erase X history' (140). Ironically, many of these searches also include 'AOL' as a collocate, e.g., 'how do i delete my history on aol' or 'remove my history from aol'. The table below summarizes these collocations.

	my 199	my 1836	history
clear 123	clear 510	clear 1413	history
aol 41	aol 238	aol 668	history
delete 71	delete 353	delete 563	history
		search 382	history
	erase 140	erase 208	history
		browser 132	history

Tuesday, October 10, 2006

Tapera-DHH Survey of History Blogs

I've been working with Nicolás Quiroga of Tapera on a survey of history blogs. (To be fair, Nicolás has actually been doing most of the work.) Anyway, he has created a number of graphs of the preliminary results and posted them to his blog [1, 2, 3]. We also have a wiki page for the project on the Western Digital History server. The wiki is set up so that anyone can read it, but you need an account to edit it. I'm happy to provide access to the wiki to other digital historians who are interested in playing with or extending the results. If you'd like to participate in the ongoing blog survey, please mail your answers to the questions to Nicolás at tapera@tapera.info

Friday, October 06, 2006

Zotero Beta Launched

In previous posts I've discussed the great new research tool Zotero [1, 2]. The public beta of the software launched yesterday, with a new website, a blog, user forums and greatly extended documentation including a wiki for developers. Zotero's creators have been busy in the few weeks since I reviewed the pre-release beta. They've added support for reusing tags, made it easier to add notes to saved sources and added a bunch of new fields to the bibliographic records. As before, the interface is clean and quite intuitive and the program works smoothly when you need it and doesn't get in your way when you don't. It's a beautiful piece of work.

Something I hadn't noticed before: Zotero uses the OpenURL framework to provide support for context-sensitive services. This means that you can tell the program to locate a source that you are interested in, and it will look for it in your local library.

The feature list gives you some idea of where Zotero is going (and where you can help take it). Planned features include shared collections, remote library backup, advanced search and data mining tools, a recommendation engine with RSS feeds and word processor integration. Zotero is already much more than bibliographic management software. It is a "platform for new forms of digital research that can be extended with other web tools and services." And it rocks.

Tags: zotero

Tuesday, October 03, 2006

On N-gram Data and Automated Plagiarism Checking

In August, Google announced that they would be releasing a massive amount of n-gram data at minimal cost (see "All Our N-gram are Belong to You").

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.

In brief, an n-gram is simply a collocation of words that is n items long. "In brief" is a bigram, "a collocation of words" is a 4-gram, and so on. For more information, see my earlier post on "Google as Corpus."

The happy day is here. For US $150 you can order the six DVD set of Google n-gram data from the Linguistic Data Consortium. While waiting for my copy to arrive, I figured that I could take this opportunity to suggest that the widespread availability of such data is going to force us to rethink the idea of plagiarism, especially the idea that plagiarism can be detected in a mechanical fashion.

My school, for example, subscribes to a service called Turnitin. On their website, Turnitin claims that their software "Instantly identifies papers containing unoriginal material." That's a pretty catchy phrase. So catchy, in fact, that it appears, mostly unquoted, in 338 different places on the web, usually in association with the Turnitin product, but also occasionally to describe their competitors like MyDropBox.

In the old days, say 2001, educators occasionally used Google to try and catch suspected plagiarizers. They would find a phrase that sounded anomalous in the student's written work and type it into Google to see if they could find an alternate source. I haven't heard anyone claim to have done that recently, for a pretty simple reason. Google now indexes too much text to make this a useful strategy.

Compared with Google, Turnitin is a mewling and puking infant (N.B. allusion, not plagiarism). At best, the company can only hope for the kind of comprehensive text archive that massive search engines have already indexed. With this increase in scale, however, comes a kind of chilling effect. Imagine if your word processor warned you whenever you tried to type a phrase that someone else had already thought of. You would never write again. (Dang! That sentence has already been used 343 times. And I know that I read an essay by someone on exactly this point, but for the life of me I can't locate it to cite it.)

What Google's n-gram data will show is that it is exceedingly difficult to write a passage that doesn't include a previously-used n-gram. To demonstrate this, I wrote a short Python script that breaks a passage of text into 5-grams and submits each, in turn, to Google to make sure that it doesn't already appear somewhere on the internet.

My university's Handbook of Academic and Scholarship Policy includes the following statement, which provides a handy test case.

NOTE: The following statement on Plagiarism should be added to course outlines:
“Plagiarism: Students must write their essays and assignments in their own words. Whenever students take an idea, or a passage from another author, they must acknowledge their debt both by using quotation marks where appropriate and by proper referencing such as footnotes or citations. Plagiarism is a major academic offence (see Scholastic Offence Policy in the Western Academic Calendar).”

Here are the number of times that various 5-grams in this statement have been used on the web, sorted by frequency:

5740 "should be added to course"
1530 "idea or a passage from"
1480 "assignments in their own words"
1400 "where appropriate and by proper"
1380 "or a passage from another"
1270 "an idea or a passage"
1120 "and assignments in their own"
0923 "plagiarism is a major academic"
0774 "a passage from another author"
0769 "essays and assignments in their"
0704 "students must write their essays"
0635 "they must acknowledge their debt"
0628 "must write their essays and"
0619 "write their essays and assignments"
0619 "marks where appropriate and by"
0606 "acknowledge their debt both by"
0605 "is a major academic offence"
0596 "both by using quotation marks"
0595 "appropriate and by proper referencing"
0588 "policy in the western academic"
0585 "and by proper referencing such"
0585 "referencing such as footnotes or"
0585 "scholastic offence policy in the"
0583 "must acknowledge their debt both"
0579 "by using quotation marks where"
0573 "such as footnotes or citations"
0572 "proper referencing such as footnotes"
0570 "using quotation marks where appropriate"
0561 "their debt both by using"
0553 "take an idea or a"
0549 "debt both by using quotation"
0549 "in the western academic calendar"
0548 "see scholastic offence policy in"
0546 "offence policy in the western"
0544 "quotation marks where appropriate and"
0503 "by proper referencing such as"
0492 "their essays and assignments in"
0490 "note the following statement on"
0479 "in their own words whenever"
0453 "whenever students take an idea"
0452 "from another author they must"
0442 "students take an idea or"
0432 "another author they must acknowledge"
0389 "citations plagiarism is a major"
0385 "their own words whenever students"
0377 "passage from another author they"
0373 "own words whenever students take"
0368 "or citations plagiarism is a"
0366 "footnotes or citations plagiarism is"
0366 "a major academic offence see"
0355 "as footnotes or citations plagiarism"
0353 "the following statement on plagiarism"
0348 "major academic offence see scholastic"
0338 "offence see scholastic offence policy"
0333 "academic offence see scholastic offence"
0179 "plagiarism students must write their"
0096 "plagiarism should be added to"
0066 "following statement on plagiarism should"
0062 "be added to course outlines"
0033 "statement on plagiarism should be"
0030 "on plagiarism should be added"

Beyond the mechanical, there are a lot of murky conceptual problems with plagiarism. To claim that the core value of scholarship has always been to respect the property rights of the individual author is wildly anachronistic. (For a more nuanced view, see Anthony Grafton's Forgers and Critics and Defenders of the Text.) A simpleminded notion of plagiarism also makes it difficult to explain any number of phenomena we find in the actual (as opposed to normative) world of text: Shakespeare, legal boilerplate, folktales, oral tradition, literary allusions, urgent e-mails about Nigerian banking opportunities and phrases like "all our n-gram are belong to you."

In a 2003 article in the AHR, Roy Rosenzweig wrote about the difficulties that historians and other scholars will face as they move from a culture of scarcity to one of abundance. In many ways, this transition has already occurred. It's time to stop pretending that prose must always be unique, or that n-grams can be property. All your prose are belong to us.

Tags: Google | n-grams | pedagogy | property rights

Thursday, September 28, 2006

No 'Secret Syllabus' for Digital History

My colleague Rob MacDougall recently suggested that we teach to two syllabi, the one we give to students and a "secret" one:

Every course we teach has two syllabi, I think. There’s the visible one, the actual list of readings and topics we assign to our students. And then there’s the secret syllabus, made up of whatever assortment of books and articles we also happen to be reading while teaching the course. These are the various bees and bats in our belfries and bonnets, the things we’re chewing on as we walk into the classroom, the new interpretations and the rediscovered classics that get us fired up about a topic we may have taught several times before.

It's a fun observation and it rings true for me. It explains the asides that I bring to my survey lectures and the discussions that I have with students about the discrepancies between what I want to talk about in class and what it says in the text. When I talk about the fur trade, I have to make sure that I talk about the staples thesis, but I am really fired up about the kinds of questions that Carolyn Podruchny has been asking. Why did voyageurs practice a mock baptismal rite but not, say, a mock communion? What does this tell us about how they understood space and place? What was the connection between the aboriginal windigo and the European werewolf? (For these and much more, see her wonderful forthcoming book Making the Voyageur World). When I talk about the Jesuits, I have to make sure that they know about Paul Le Jeune and the Jesuit Relations. But in the back of my mind I'm thinking of Peter Goddard's discussion of the degree to which the Jesuits of New France actually believed in the agency of demons.

But not every course has a secret syllabus. Notably, I don't have one for my digital history grad class. The things that I put in the syllabus this year are exactly the things I am struggling with right now. In 20 years, perhaps, when digital history is an established field with a hundred visions and revisions ... maybe then I will feel the impatient tension between what I need to tell them and what I want to tell them. But for now all we have are the visions. When one of my tech-savvy students tells me in a bemused way that he has never felt so lost in all his life, I can agree. Me either. Right now digital history is an exploration. We don't know what we're going to find. We don't even have a map, never mind a secret one. That is why it is such a great time to become a digital historian.

Tags: digital history | voyageurs

Sunday, September 24, 2006

Student Reflections on Digital History

The autumn term is now under way and our students are back in school. Those taking grad courses in digital history at The University of Western Ontario and George Mason University are doing reflective blogging as part of their coursework. While not exactly the same, the two courses cover a lot of the same ground, and the students are wrestling with many of same issues. Their blogs make for very interesting reading, and Josh Greenberg (who is teaching 696) and I hope that there will be opportunties for members of the two classes to interact with one another. So by way of introduction, History 513F meet History 696.

Hist513F
Bryan Andrachuk
Lauren Burger
Diana Dicklich
John Jordan
Kelly Lewis
Molly Macdonald
Adam Marcotte
Carling Marshall
Kevin Marshall
Jeremy Sandor

Hist696
Bill Andrews
Amanda Bennett
Jeff Bowers
James Garber
Misha Griffith
Karin Hill
Thomas Jenkins
John Lillard
Jenny Reeder
Steven Scott
Jennifer Skomer
Dieter Stenger
Tad Suiter
Karen Tessier
Billy Wade
Alan Walker
Gwen White

Update 29 Sep 2006. My friend Mills Kelly notified me that there are actually two grad digital history courses at GMU this semester. He is teaching History 689: Teaching and Learning in the Digital Age, and his students are blogging too. Welcome!

Tags: blogs | digital history | history education | pedagogy

Tuesday, September 19, 2006

Extending Zotero for Collaborative Work

In an earlier post I reviewed the forthcoming open-source Firefox extension Zotero. In brief, Zotero is able to automatically extract citation information from a web page that you are browsing and store it in a database. It also lets you organize and search through your research notes and do a number of other useful things. Since it is open source, users are free to develop the software and add new features. In my review I suggested a few features that could be added, such as support for RSS feed aggregation and spidering / text mining within the browser.

Here I'd like to speculate a little bit more about the kinds of things that Zotero could be used for, this time concentrating on scholarly collaboration. In the version that I reviewed, Zotero stores citation information in a local SQLite database. It also allows you to import and export information in a variety of XML-based standard forms. Putting these two features together, it should be straightforward to send XML records over the web so that they could be stored in a nonlocal SQL database. Imagine, then, that in the Zotero collections panel you could have nonlocal folders that were automatically synchronized to shared external databases. You could subscribe to a bibliographic community with a particular interest and receive citations to sources as well as contribute your own. Such communities might form around a particular project (e.g., the references to be cited in an edited volume or jointly-authored textbook) or around a particular event (e.g., Hurricane Katrina) or an emerging research field (e.g., digital history). Since Zotero also allows you to work with research notes, PDFs, images, and other kinds of files it would be possible to synchronize most of the files associated with a particular project amongst collaborators. It would also be easy to flag information that a particular user had already seen, so that new items could be visually coded to draw attention. (In the Sage RSS feed aggregator, for example, feeds that haven't been updated are in normal font and those with new information are in boldface.)

Tags: browser | computer supported collaborative work | hacking | open source | xml | zotero

Saturday, September 09, 2006

What We Need Now Is a Good Trolling Engine...

One thing that is difficult to do with a traditional search engine is find documents that were written at a particular time (the new Google News Archive Search being a notable exception). Suppose, for example, that you are starting a research project on the environmental history of nineteenth-century gold rushes in North America. Are there good collections of online primary sources that you should know about? Of course, but they can be hard to find. It would be great to be able to limit your Google searches to documents written during particular date ranges, e.g., 1848-55 (for the California gold rush), 1858-65 (Cariboo) or 1896-99 (Klondike).

This turns out to be more difficult than you might think at first. Google Advanced Book Search allows you to specify a publication date range. So a search for "gold california date:1848-1855" returns books like Walter Colton's Three Years in California (1850), which you can actually download as a PDF. But other books are not going to show up, like A Doctor's Gold Rush Journey to California by Israel S. Lord, which was written from 1849 to 1851 but not published until 1995. In cases like these, you are searching through metadata rather than through the document itself. Most of the material on the web doesn't have enough metadata to be really satisfactory for this kind of searching.

Furthermore, depending on the project you may not always have good search terms. Suppose you are thinking of becoming a digital medievalist and want to get some idea of what kinds of sources you might be able to work with. How do you search for machine-readable documents written in Old English? Obviously you will try to make use of the traditional scholarly apparatus and of online resource guides like The ORB.

To supplement this kind of activity, I'm thinking it would very nice to have what I'm going to call a "trolling engine," a tool that can sift through the Internet on a more-or-less continuous basis and return items that match a particular set of criteria determined by a human analyst. You would set it up, say, to look for documents written during the Cariboo gold rush, or written in Old English around the time of King Alfred, or those that may have be written by ornithologists in the West Midlands in the 1950s (if you're interested in the latter, you're in luck).

So how would a trolling engine work? In present-day search engines, spiders scour the web downloading pages. A massive inverse index is created so that there is a link running from each term on every page back to the page itself. Once this blog post is indexed by Google's spiders, for example, there will be links to it in their inverse dictionary from "trolling," "engine," "spiders" and many other terms. The catch is that there is not a lot of other publicly-accessible information associated with each term. Suppose, however, that Google also tagged each term with its part-of-speech and parsed all the text in the surrounding context. Then you would be able to search for items in a particular syntactic frame. As Dan Brian showed in an interesting article, you could search for all instances of "rock" used as an intransitive verb, and find sentences like "John thought San Francisco rocked" without finding ones like "The earthquake rocked San Francisco." There is already a pretty cool program called The Linguist's Search Engine that lets you do this kind of searching over a corpus of about 3.5 million sentences.

In fact, being able to search the whole web for words in particular syntactic frames could be a very powerful historical tool for a simple reason: languages change over time. Take "sort of/kind of." For at least six hundred years, English speakers have been using these word sequences in phrases like "some kind of animal," that is, as a noun followed by a preposition. By the nineteenth century, "sort of" and "kind of" also appeared as degree modifiers: "I kind of think this is neat." In a 1994 Stanford dissertation, Whit Tabor showed that between the 16th and 19th centuries, "sort of" and "kind of" increasingly appeared in syntactic frames where either reading makes sense. That is, "kind of good idea" might be interpreted as [kind [of [good idea]]] or [[[kind of] good] idea]. So if you find a document that uses "sort of" or "kind of" as a degree modifier, you have one clue that it was probably written sometime after 1800. (See the discussion in Manning and Schütze for more on this example.)

It's not just these two word sequences that have a history. Every word, every collocation has a history. A word like "troll" is attested as a verb in the fourteenth century and as a noun in the seventeenth. Its use as a fishing term also dates from the seventeenth century. If your document is about trolls it was probably written after 1600; if it is about trolling, it could have been written earlier (see my post on "A Search Engine for 17th-Century Documents"). By itself, the earliest attested date of a single word or collocation is weak evidence. If we were to systematically extract this kind of information from a very large corpus of dated documents, however, we could create a composite portrait of documents written in AD 890 or during the Cariboo gold rush or at any other given time.

A similar logic would help us find documents written by ornithologists. In this case, the training corpus would have to be tagged with a different kind of metadata in addition to the date: the occupation of the author. Once we had that we could discover that two words that appear separately on millions of web pages, "pair" and "nested", occur quite rarely as the collocation "pair nested." That's the kind of thing an ornithologist would write.

Tags: n-grams | perpetual analytics | search | spidering | text mining | trolling engine

Thursday, September 07, 2006

A First Look at Zotero

Our school year officially started today but I'm not teaching on Thursdays this term, so I was able to spend the day hacking the pre-release beta of Zotero and listening to an album of pioneering electronic music. The music turned out to be the perfect complement to the software.

The basics. Zotero is the brainchild of a team of digital historians at the Center for History and New Media at George Mason University: Dan Cohen, Josh Greenberg, Simon Kornblith, David Norton and Dan Stillman. Their basic goal was to create a freely available, open source tool that would put the essential functions of standalone bibliography software like Endnote into the Firefox browser. Since we already spend most of the day reading and writing in our browsers (e-mail, blogging, newsfeeds, online journals, e-books, library catalogs, etc.) this makes a lot of sense. Like commercially available packages, Zotero allows you to create and cite from a database of primary and secondary references of various types (books, newspaper articles, journal articles, and so on). Instead of starting a separate program, however, you can enter records as you browse library catalogs (e.g., Library of Congress, WorldCat), bookstores (Amazon.com) and many other websites.

Zotero already has some distinct advantages over commercial bibliographic management software. For one thing, you can arrange your sources hierarchically. The interface is divided into three panels which give you different views into your collections of sources, using the familiar file folder metaphor. The lefthand panel shows a top level view of folders, the centre panel shows the contents of the currently selected folder, and the righthand panel shows a tabbed display of the details for the currently selected item. You can see a screenshot of the interface in the quick start guide. It is easy to customize the information presented in the middle panel. Zotero not only allows you to create bibliographic records, but also makes it easy to link to webpages, to snapshots of webpages, to other files like JPEGs and PDFs, and to notes which you can create directly in your browser. You can tag records with Library of Congress subject headings (LCSH) or with your own tags, or a mixture of the two. You can also link records within your collections to one another. (I have to admit that I haven't quite figured out a use for this.) The interface also easily toggles between views that take up all, some or none of the browser page. Finally, there is a feature called "smart collections" which lets you save the results of a search as a virtual folder. This is handy because it gives you different views of the same data without requiring you to enter it in multiple places.

Sensing citation information. Let's take it as read that Zotero is a great tool for keeping track of your bibliographical information without leaving the browser. But there's more. When you browse a page that has citation information embedded in it, Zotero "senses" that and lets you know. You then have the option of automatically scraping some or all of the data to your bibliographic database. The beta version already supports this behaviour at a number of widely used sites like the Library of Congress, WorldCat, Amazon.com and the New York Times. In my trial runs, it worked perfectly at the Library of Congress and Amazon, and with a few hiccups at a number of other sites. Since Zotero is extensible, expect to see user-contributed scrapers start to appear as soon as the beta is released. (More on this below). In my own university's library catalog, I had to switch to MARC view, and then Zotero worked perfectly. But then scrapers are notoriously brittle.

Hacking Zotero. Zotero exports to RDF/XML and imports from a number of XML-based standards (RDF, MARC, MODS and RIS). Since it is pretty easy to write programs to manipulate RDF/XML in high-level programming languages, it will be possible for digital historians to collect resources via browsing in Zotero, then automate the processing of those records. It will also be possible to write programs that collect raw data (e.g., spiders), do some processing and then write the output in a format that can be imported into Zotero and scanned by a human interpreter. In other words, your Zotero collection (or anyone else's, or a whole bunch of people's) can be part of a workflow that includes both people and machines. This will be very useful for text and data mining projects.

Behind the scenes. (This part of the review can be safely skipped if you aren't going to modify the program yourself). Since Zotero is open source, it is possible to look at the code and see how it works. Then hack it. Zotero installs two folders in your Firefox profile, one called "zotero" that holds your data, and one called "extensions/zotero@chnm.gmu.edu" that holds the source code. The former contains a SQLite database that Firefox (and thus Zotero) uses to hold client-side information. You can download a SQLite utility that allows you to interact with the tables WHEN YOU ARE NOT RUNNING FIREFOX. (Otherwise you run the risk of hosing your database.) With this utility you can enter commands like ".schema" to see the SQL statements needed to create the database, or "select * from tags" which shows you the tags you have already created. Modifications to the Zotero code can be done in a few places, notably the files "schema.sql" and "scrapers.sql". If you wanted to add record types to your database, for example, you'd have to modify the former. The scrapers are written in JavaScript and stored in the database. Presumably, the stable release of Zotero will include some tutorials showing how to write simple scrapers, but an intrepid programmer can probably figure it out from the supplied code. (And more. You can almost feel Kornblith's pain in one of his plaintive comments: "// Why can''t amazon use the same stylesheets".)

Notes for the Zotero team. Don't let my suggestions for future improvements distract you from the bottom line: Zotero is already an amazing piece of software that could change the way we do history. The visionary team at CHNM should really be congratulated for making this, and making it open source. But since it is a beta release...

There is a time stamp when records are added to the database or modified, which is great. The same information isn't readily available, however, when one takes a snapshot of a webpage.
Library of Congress scraper: want to automatically harvest links like URLs. Future releases (or hacks) could build on this by including the ability to spider within Zotero.
WorldCat scraper: should grab OCLC number and put it into call number field. Again, this is crucial for automated spidering.
Geocoding: please, please add fields for latitude and longitude. Geocoded sources are worth having, and I want to mash them up with the Google Maps API and MapServer.
Record types: at the very least, we need a generic record type to hold odds and ends. Ideally there would also be new types for blogs, blog posts, archival fonds, generic material objects and audio cds.
Tags: when adding tags, should have a lookup table so you can select one that you've already used (good for consistency).
Tags: nice to have a way of browsing by tag (as in del.icio.us), probably over in the lefthand panel.
RSS feeds: it would be awesome if the functionality of Sage was built into the lefthand pane. Many of the sources I want to cite these days are blog posts.

Tags: hacking | zotero