Saturday, May 24, 2008

A Naive Bayesian in the Old Bailey, Part 1

One of the great benefits of having a blog has been that people who are interested in digital history find me and let me know what they are doing in the field. For a couple of years now, I've enjoyed an intermittent but invariably thought-provoking correspondence with Tim Hitchcock, one of the creators of the wonderful digital archive of the Old Bailey proceedings. The OB team has recently added records for the period from 1834 to 1913, resulting in a total of almost 200,000 trial records, all tagged with XML. When Tim offered me access to the XML files for a data mining project a few months ago, I jumped at the chance. This is still very much work in progress, but I've decided to blog about the process for others who are interested in doing similar things, whether with the Old Bailey archive or some other.

I started by downloading local copies of all of the files. This is usually a good idea both because it makes the processing faster and because you aren't hammering the archive's servers every time you need to access a record. There are a number of different ways to do something like this, and it is very handy for historians to be familiar with at least some of them. One possibility is to use a Firefox extension like DownThemAll. This allows you to download all of the links or images in a webpage. It also allows you to pause and resume the download process, which can be useful when you're working with a large number of files. For those who are more comfortable with scripting and prefer command line tools, it is hard to beat GNU Wget. Both programs are free. The third alternative is to write your own script in a language like Python or Perl. This option is most difficult, but gives you more control over various kinds of preprocessing, like dealing with accented characters. (For more, see the section on this in The Programming Historian.) It takes a while to download a large batch of files, but once you have them you're ready to move on to the next step.

Tags: | | | | |

Saturday, May 17, 2008

Geo-DJ, Part 3: The Simplest Working Version

Some people may have the ability to come up with something awesome on their first pass--say, Athena springing from the forehead of Zeus fully formed--but I've learned that I have to make some mistakes along the way. So I try to come up with the simplest working version of a project, then complexify it gradually. Of course, things being what they are, you can usually improve something by simplifying it, so the first, apparently simplest, version is actually somewhere in the middle of the scale from perfect to perfectly foobar.

With the geo-DJ, I imagine the simplest working version to be something like a metal detector for historical landscape features. Suppose you know that there used to be an electric streetcar running through the middle of downtown, but most material traces of it have since been torn up. If you have a map of the streetcar route, you can use existing landmarks to georeference it, and determine the latitude and longitude of the endpoints (and any additional inflection points, but let's ignore those and work with a purely linear feature). The locations of the endpoints need to be stored in memory.

As the user walks around, the geo-DJ loops through the following algorithm. First, determine the user's current position. Then, determine the line through the endpoints (the former rail), determine the length of the perpendicular line from the user to the rail line (i.e., the magnitude of a normal vector), and scale the pitch of a tone that is playing in the headphones. Repeat, ad infinitum... or until the batteries drain, whichever comes first. If the user steps toward the rail, the pitch of the sound increases. If he or she steps away from it, the pitch decreases. Using this version of the system, a person can explore the lineaments of landscape features which may no longer exist. See Michal Migurski's great air photo of San Francisco "healing" around a former railroad.

Tags: | | | | | | |

Friday, May 16, 2008

Geo-DJ, Part 2: Storage vs. Computation

In my last post, I mentioned that I'm working with a couple of talented students this summer on digital history projects, and talked a bit about Adam Crymble's Zotero translators. The other person who is working with me is Devon Elliott. Last year Devon came up with a plan to use wikis in archives and built a model of Sputnik that contained a microcontroller, a thermistor to sense temperature changes and an accelerometer to respond to motion. The information about the model's state was conveyed by modulating the frequency and duration of a beeping signal. Devon did the programming and electronics without any help from me, so I knew he would be the perfect collaborator for the geo-DJ project.

The geo-DJ is a wearable iPod-like device. As you wander around a present-day environment, it uses GPS to determine your position and synthesizes an electronic soundtrack that reflects former land-use patterns. Creating something like this wouldn't be too difficult using a lightweight laptop or a powerful handheld computer running GIS software. But we're interested in doing the project at as low a level as possible, preferably using an open source microcontroller board like Arduino.

In the history of computing, people often faced the limits of both memory capacity and processing speeds. Consider the problem of determining trigonometric functions for particular values. There are algorithms for computing the sine of an angle, but they're complicated. Before the widespread adoption of digital calculators it was common for people to use trig tables, a clear case of using more storage space to simplify or speed up calculation. With digital calculators or general-purpose computers, it is simpler and faster to punch in the calculation than to look it up in a trig table. But here is the tricky part: it may not be simpler for the computer to do the computation. The software may involve looking up the value of various trig functions in tables, even though that is not apparent to the user.

Doing the geo-DJ project on a small computer like Arduino approaches these limits in (at least) two places: GIS and music synthesis. In the case of the GIS, we want to know the person's distance from the various points, lines and polygons that are used to represent historical features of interest. There are algorithms for computing these measures, but our processor is slow and our application requires real-time feedback. It might make more sense to pre-compute the measures and store the information about distances in a multi-dimensional array. Of course, the basic amount of memory on an Arduino is also very limited, so we have to find the optimal balance. In the case of music synthesis, a similar problem arises. Sounds have complicated waveforms which can be computed or looked up in a wave table. Once again, we will have to find the right balance between storage and computation.

It may be that the platform that we're trying to use is too simple. We may have to add more memory, or dedicated signal processing hardware, or both. But that is one of the things that makes a project like this fun. By working close to computational limits we not only have more of a challenge, but more of a sense what computing used to be like, long ago, when we were kids.

Tags: | | | | | | |

Saturday, May 10, 2008

Beginning in the Middle

For the past few summers, I've been taking on talented students to work on digital stuff. Rather than giving them a canned project or expecting anything in particular to happen, I usually give them a difficult problem and then step back. The results have been very encouraging, especially since I tend to choose independent students who are OK with my laissez faire approach.

One of the people who is working with me this summer is Adam Crymble. Last year he managed to come up with a low-tech public history hack, make some 3D animations, and teach himself enough HTML and CSS to hand code a web page. So for a summer project I suggested he try and write some translators for Zotero. He doesn't have any training for this, and I am of limited assistance since I don't really know JavaScript. Sink or swim, buddy!

Adam intuitively started where I would. He printed out all the code and documentation that he could get his hands on, then started using colored highlighters to focus his attention on the parts that he could understand. He also used Wikipedia, the W3 Schools, and our library's Safari subscription to O'Reilly books online. In the space of a couple of weeks, he's made great progress and learned enough so that I'm still of no use to him.

Reading other people's code is always hard, but it is one of the best ways to learn how to program. As Abelson and Sussman write in Structure and Interpretation of Computer Programs, "a computer language is not just a way of getting a computer to perform operations but rather ... a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute." The beginning programmer starts out much like a child who is acquiring a natural language: immersed in a medium produced by people who are already fluent.

Historians have a secret advantage when it comes to learning technical material like programming: we are already used to doing close readings of documents that are confusing, ambiguous, incomplete or inconsistent. We all sit down to our primary sources with the sense that we will understand them, even if we're going to be confused for a while. This approach allows us to eventually produce learned books about subjects far from our own experience or training.

I believe in eating my own dogfood, and wouldn't subject my students to anything I wouldn't take on myself. As my own research and teaching moves more toward desktop fabrication, I've been reading a lot about materials science, structural engineering, machining, CNC and other subjects for which I have absolutely no preparation. It's pretty confusing, of course, but each day it all seems a little more clear. I've also been making a lot of mistakes as I try to make things. As humanists, I don't think we can do better than to follow Terence's adage that nothing human should be alien to us. It is possible to learn anything, if you're willing to begin in the middle.

Tags: | | |

Sunday, May 04, 2008

The Programming Historian is Now Available

The Programming Historian is now available on the NiCHE: Network in Canadian History & Environment website. This work is an open-access introduction to programming in Python, aimed at working historians (and other humanists) with little previous experience. Introductory lessons teach you how to
  • install Zotero, the Python programming language and other useful tools
  • read and write data files
  • save web pages and automatically extract information from them
  • count word frequencies
  • remove stop words
  • automatically refine searches
  • make n-gram dictionaries
  • create keyword-in-context (KWIC) displays
  • make tag clouds, and
  • harvest sets of hyperlinks
The Programming Historian is a work-in-progress. We are constantly adding new material, much of it driven by reader request. Upcoming topics will include indexing, scraping projects, simple spiders, mashups and much more.

Tags: | | | | | |