Saturday, April 29, 2006

SIP Mapping

In an earlier post, I mentioned the fact that Amazon keeps track of phrases that are distinctive to a small set of books. These SIPs (statistically improbable phrases) can be used to get some idea of the conceptual landscape in and around particular works, and thus can be used to generate bibliographies. Ideally, of course, the process could be automated. If machine-readable versions of the books were available, it could also be used as part of a text mining project.

I haven't had a chance to do much programming recently so I thought I would put together a rudimentary hack to scrape SIPs and create a map. I also wanted to learn how to use the open source Graphviz visualization toolkit, so I used a Perl module to link to it. If you look at the code for the hack, you can see how simple it is to create pretty neat graphs. The figure below (1Mb) shows what happens when you start with Diamond's Guns, Germs, and Steel and follow the SIPs to adjacent books. The figure is more than 8,000 pixels wide, so you have to zoom in to see the detail ... and at that level it is pretty complicated. I will leave the implementation of a better graph browser for a future hack.



Tags: | | | | | |

Tuesday, April 25, 2006

My Folders at the McCord

I've just returned from a history education workshop in Vancouver, where I met Marie-Claude Larouche, co-ordinator of the online education program at the Musée McCord in Montreal. The McCord has digitized a large number of sources already, including more than 120,000 images, and has created a number of innovative online displays. One of my favorites is "Urban Life through Two Lenses" which uses some clever Flash programming to allow the user to superpose contemporary and historical photographs in interesting ways. The McCord has also contracted with historians to write a number of short tours of the collection, highlighting major events and aspects of everyday life.

So much is to be expected from a savvy museum in the 21st century. One of the innovative things the McCord has done, however, is to allow users to create their own tours through a My Folders mechanism. Anyone can create an account, select digital images from the McCord's collection, and use them to support a historical narrative. As an example of the potential of this, see "A Vile Style," a narrative created by Christy Yau, a student in Tom Morton's grade 10 history class at David Thompson High School in Vancouver. Using corseting as an example, Yau asks the question "Is fashion worth dying for?" She shows that nineteenth-century history can be relevant to present concerns. Her argument is well-supported by her pictorial sources, and charmingly written, e.g., "As one should know, the human body was never meant to be compressed to the point of deformity for the sake of fashion, or anything else, for that matter."

Tours written by students and members of the public are stored on the McCord's server and made available, with a disclaimer, on the web. This kind of mechanism has a number of important implications for the practice of history. For one thing, it greatly reduces the information costs associated with using a distant archive. It is easier for students at a high school in Vancouver to use the digital resources in Montreal than it would be for them to use the material resources in their own city. The McCord is also building a resource which can be used by scholars of history education and/or public history. How do people construct historical narratives from visual sources? What kinds of inferences do they think are supported or warranted by what kinds of sources? How do their understandings of particular sources differ from the interpretations of professional historians?

Tags: | | | | |

Sunday, April 09, 2006

Information Costs

The basic idea of an information cost is pretty simple: it costs something to learn something. We all know that books cost money, reading takes time, universities charge tuition, archival work and fieldwork are expensive, file folders need to be stored, computers need to be replaced (frequently) and people are forgetful. Once you start to take information costs into account, however, there are surprising consequences for economic history, property rights, law, and many other fields (see, for example, the work of Douglass North, Yoram Barzel and Ronald Coase.)

We are at a point where it is possible to imagine that nearly all historical sources could become digital and readily accessible over the next few decades. This means that the relative cost of accessing any particular source will be near zero, and the practice of history will be completely transformed as a result.

Past historical projects were largely shaped by information costs, although not explicitly framed in those terms. It was easier to read through the contents of one archival box than to go through a number of different boxes: typically, each box had to be requested, retrieved from storage, stored in the reading room while someone was looking at it, and then returned to storage. For practical reasons, archives limited the number of boxes that could be requested at a time, and often took a substantial amount of time to process each request. It was easier to use the resources of a single archive than a number of archives. The costs of access were multiplied by travel between archives and by the need to learn the ropes at each one. Furthermore, much of the material in archives was not indexed in finding aids, and it was even more difficult to search effectively across archives.

As archives digitize their holdings, historians can no longer expect to face these costs. At the moment, it is much easier for me to examine the 80,000 historical photographs online at the BC Archives in Victoria, BC (3,285 kilometres away) than it is to study historical photographs in the regional collection of my own university library. Eventually, these kinds of discrepancies will vanish. In the meantime, however, historians are confronted with an unfamiliar and counterintuitive set of information costs as they approach new projects, or advise students beginning research.

In the long run, the complete digitization of our archival base may be accompanied by the emergence of a separate field of historical informatics. The current situation in biology is instructive. At first glance, the stuff of biology—genes, cells, organisms, ecosystems, and so on—would seem to have little to do with information processing. The past few decades, however, have seen the emergence of bioinformatics, an explicitly computational form of biology. Students in many areas of the life sciences now find that they need a basic understanding of statistics, applied math, and programming. Precisely the kind of things that young historians need to start learning now.

In a sense, the information-processing revolution in history is one part of a much larger and longer-term trend that J. R. and William H. McNeill have traced in The Human Web. Patterns of interaction and exchange have become ever denser and faster over the course of the Holocene, with a consequent reduction in information costs.

Tags: | |

Wednesday, April 05, 2006

Methodology for the Infinite Archive

In a widely circulated talk, the mathematician Richard Hamming suggested that researchers should ask themselves what are the most important problems in their field, and, as a follow-up question, why they are not working on them [see Hamming, You and Your Research, and for an interesting response, Graham, Good and Bad Procrastination]. Historians seem comfortable enough asking about the significance of a particular piece of current work—most directly with some form of the "so what?" question—but we seem less willing (or perhaps able) to enter into discussions about the relative importance of current approaches or schools. Perhaps it is the historiographic tradition: with the benefit of hindsight we can all agree on the importance of the Annales school. For more recent approaches, like big history, most historians seem to be taking a wait-and-see approach.

Now I have to admit that I have reservations about big history, some of which I've spelled out in a forthcoming article in the journal Rethinking History. Setting those reservations aside, however, I think that the project that the big historians are attempting is an important one. They are trying to put the past, from the Big Bang to the present day, into a single, coherent narrative. Such an ambitious project is bound to fail in some ways, of course, but the failures promise to be interesting and informative. What more could we want? We know that the next generation will revise our interpretations ... let's give them something that is worth revising. [For an introduction to big history, see David Christian's wonderfully readable Maps of Time.]

It probably won't come as much of a surprise that I think that the questions raised by digital history are some of the most important that we face. The explosion of printed material after the fifteenth century fundamentally changed scholarship, making it much easier to compare different editions of the same text, making it possible to read extensively as well as intensively, and creating the conditions for widespread literacy [see, for example, the essays in The Renaissance Computer]. We are currently in the midst of another such transformation, one that will give us nearly instantaneous access to the contents of the world's great libraries and archives, will radically democratize knowledge production, and will force us to think of machines as part of our audience.

So does this mean that we have to throw out everything we hold dear? Of course not. There's still no substitute for being able to read closely and critically; as Timothy Burke put it a few months ago in Easily Distracted, "interpretation is the antibody" against viral marketing and other kinds of spin and propaganda. Given the low average quality of online information and the read/write nature of the web, we need the work of archivists, librarians and curators more than ever.

We also need some new skills. We need to be able to digitize and digitally archive existing sources; to create useful metadata; to find and interpret sources that were "born digital"; to expose repositories through APIs; to write programs that search, spider, scrape and mine; to create bots, agents and mechanical turks that interact seamlessly with one another and with human analysts.

I've noticed that many people, otherwise very erudite, feel comfortable coming up to me and saying, "I'm a luddite," like it was something to be proud of. So how well did that turn out the first time? Don't we study history, in part, so we don't have to repeat it? [See Thompson's Making of the English Working Class for more on the Luddites.]