Digital History Hacks (2005-08): June 2007

Wednesday, June 27, 2007

Clustering with Compression

Last spring I posted short piece about Rudi Cilibrasi and Paul Vitányi's use of compression as a universal method for clustering ["Clustering by Compression," IEEE Transactions on Information Theory 51, no. 4 (2005): 1523-45, PDF]. The basic idea is that a compression algorithm makes a string shorter by keeping track of redundancies and eliminating them. Suppose you have two strings, x and y. If there is some overlap between them, then the concatenated and compressed string xy should be smaller than the concatenation of separately compressed strings x and y. (There are more details in my earlier post). Cilibrasi and Vitányi formalized this idea as the Normalized Compression Distance (NCD).

In my earlier post I selected a handful of entries from the Dictionary of Canadian Biography and submitted them to an open source clustering program that Cilibrasi and Vitányi had provided. I chose the people that I did because I already had some idea of how I would group them myself. The results of the automated clustering were very encouraging, but I hadn't had a chance to follow up with compression-based clustering until now.

This time around, I decided to do a more extensive test. I wrote a Python program to randomly select 100 biographies from volume 1 of the DCB and compute the NCD between each pair. A second Python program combed through the output file of the first to automatically create a Graphviz script that plots all connections below a user-specified threshold. The resulting graph is shown below for NCDs < 0.77.

Looking at different parts of the figure in turn, it becomes clear that this is a remarkably powerful technique, especially considering that the code is very simple and the algorithm has no domain-specific knowledge whatsoever. It knows nothing about history or about the English language, and yet it is able to find connections among biographical entries that are meaningful to a human interpreter.

One isolated chain of biographies consists of people active in Hudson Bay in the late seventeenth century.

A second isolated cluster consists mostly of Englishmen who settled on the coasts of Newfoundland in the early 1600s.

The main cluster is composed almost entirely of people based in various parts of Quebec in the seventeenth century. There is an arm of Acadian settlers, some of whom spent time in Quebec.

There is a somewhat puzzling arm that I haven't really figured out.

And there is the main body of the network and a downward projection that are comprised entirely of seventeenth-century Québécois.

One benefit of this method is the incredible speed with which it executes. It took only seconds to calculate 4,950 NCDs; clustering the entire DCB would require computing 35,427,153 distance measures and would take less than a day to run on my inexpensive home computer. I'll save that hack for another time.

Tags: data compression | data mining | Dictionary of Canadian Biography | digital history | Kolmogorov complexity | visualization

Tuesday, June 19, 2007

Hope, the New Research Strategy

In February of this year, I realized that I needed to read an article from an obscure but relatively important scientific journal published in 1797. Following the advice that Thomas Mann gives in his excellent Oxford Guide to Library Research, I started by assuming that what I was looking for must exist. I spent a couple of hours looking for an online copy, using many of the techniques that I've described in this blog, and a few I haven't gotten around to describing. No luck.

At that point, of course, I might have found a copy in one of the libraries or archives in the greater Toronto area and gone there to read it. Instead, I decided to try something completely different. I made a note of what I wanted and recorded the details of my inconclusive search. Then I said to myself, "boy I sure hope that Google digitizes the journal soon and makes a full copy available on Google Books." At that point there was nothing left to do but wait.

Today, again, I realized that I really should read that article. When I typed the journal title into Google Books, there it was, waiting for me to download. Now, I don't recommend this strategy to anyone working on, say, their dissertation. Maybe it was a complete fluke. I suspect, however, that it is weak measure of the seismic shifts that our landscape of information costs is currently undergoing.

Tags: Google | information costs

Monday, June 18, 2007

Seeing There

In his 1991 book Mirror Worlds, David Gelernter argued that we should be trying to engineer systems that help us "to achieve something that is so universally important and yet so hard to come by that it doesn't even have a word to describe it." Gelernter called it topsight. "If insight is the illumination to be achieved by penetrating inner depths," he wrote, "topsight is what comes from a far-overhead vantagepoint, from a bird's eye view that reveals the whole--the big picture; how the parts fit together. ('Overview' comes fairly close to what I mean. But an 'overview' is something you either have or you don't. Topsight is something that, like insight, you pursue avidly and continuously, and achieve gradually)" (pp. 52-53). One route to topsight that Gelernter proposed was to build microcosmic "mirror worlds" that would faithfully reflect live data collected continuously from the world, and summarized to create levels of abstraction.

The Mirror World is directly accessible, twenty-four hours a day, to the population that it tracks. You can parachute in your own software agents. They look out for your interests, or gather data that you need, or let you know when something significant seems to be going on. You consult the Mirror World like an encyclopedia when you need information; you read it like a dashboard when you need a fast take on current status (p.6).

When I first read Mirror Worlds sixteen years ago, this seemed like science fiction, a beautiful dream. Now it seems like business as usual. Using my GPS-enabled phone and Google Maps mobile I'm able to see an aerial view of my position and route. I like to walk whenever I can, and often study radar imagery to look for gaps in rain or snowfall. If I'm stuck on the highway, I can use the phone to look ahead through a series of traffic cameras. Google has recently upped the ante by adding an incredible amount of data to Google Earth and by rolling out their new Street View for select cities. It's become natural to treat the internet as, in William J. Mitchell's colorful phrase, "a worldwide, time-zone-spanning optic nerve with electronic eyeballs at its endpoints" (City of Bits, 31).

The widespread digitization of historical sources raises the question of what kinds of top-level views we can have into the past. Obviously it's possible to visit an archive in real life or in Second Life, and easy to imagine locating the archive in Google Earth. It is also possible to geocode sources, link each to the places to which it relates or refers. Some of this will be done manually and accurately, some automatically with a lower degree of accuracy. Augmenting places with sources, however, raises new questions about selectivity. Without some way of filtering or making sense of these place-based records, what we'll end up with at best will be an overview, and not topsight.

Tags: geocoding | microcosm | place | topsight

Saturday, June 09, 2007

History Appliances: The Soundscape

I recently had a chance to visit my friends at the Center for History and New Media and give a brown bag talk about history appliances. I outlined my optimistic version of the idea and Rob MacDougall's more pessimistic (and probably more realistic) version. For people who want to get started building their own history appliances, I discussed some of the wetware, software and hardware that might be useful. While I was there, I also chatted with Dan, Mills and Tom for an episode in their excellent Digital Campus podcast series. Unfortunately I had a bad cold, so even their audio tech expert couldn't make me sound more like Barry White than Barry Gibb.

The best part of the brown bag talk was that I was able to break in the middle for a brainstorming session with the CHNM audience, where they came up with a bunch of great new ideas. Paula, for example, suggested that it might be possible to create a tunable soundscape. Set the dial for 1873 and you might hear the sounds of horse hooves and wagon wheels on cobblestones, church bells or the cries of street vendors. Such an appliance would be truly ambient, and could be based on the kinds of work being done by historians like Emily Thompson or Richard Cullen Rath.

Our discussion of the tunable soundscape quickly veered into questions of uncertainty (would we really know what the component sounds were?), veracity (how would we know if we got it right?) and lived experience (what did it sound like to the people of the time?) In the limit we would be faced with the problem raised by the philosopher Thomas Nagel in his famous essay "What is it like to be a bat?" Sure, we might make a machine that could convey exactly what it is like for one of us to be a bat, but it still wouldn't tell us what it is like for a bat to be a bat.

For me that raises one of the key benefits of the history appliance idea. We should approach the building of history appliances as a form of critical, reflective practice ... we make things, we design interactions, to give us new routes into the questions that historians have always struggled with.

Tags: ambience | history appliances | lived experience | Max/MSP/Jitter | phidgets