Thursday, June 05, 2008

A Naive Bayesian in the Old Bailey, Part 2

After downloading the XML-tagged files for the nineteenth century to our local machine, we ended up with a directory tree that looks like this:
  • Tagged_final (944 files, 9 folders)
    • Tagged_1830s_Files (62 files)
      • T18341124NW_SUP_DONE.xml
      • T18341205PH.xml
      • ...
      • T18391216CLR.xml
    • Tagged_1840s_Files (120 files)
    • ...
    • Tagged_1910s_Files (41 files)
      • T19100111GS_SUP_DONE.xml
      • T19100208GS.xml
      • ...
      • T19130401CLR.xml
Each XML file contains all of the trials that were conducted in a particular session. The file 'T18341124NW_SUP_DONE.xml', for example, is the record for 24 Nov 1834. I'm assuming that the string that follows the date in the filename ('NW_SUP_DONE') refers to the encoding process, so I'm going to ignore it.

The next step is to split each of these XML files into individual trials. Our overall strategy will be as follows. First we want to create a directory for the trial files if one doesn't already exist. Then we will get a list of all of the XML files for the decade and step through them one at a time. For each XML file, we're going to extract each trial and save it as a separate file. Since a given trial is delimited with tags that look like <trial id="t-18341124-1" n="1"> ... </trial>, we can parse it out and save it separately as 't-18341124-1.txt'. You can read this trial online at the Old Bailey archives. You can also have a look at the XML file to see what we're dealing with. The fact that the OB team provides XML makes this archive an awesome resource for digital historians, and other online sites should do the same forthwith.

There are a variety of ways to parse XML, but it is quick and easy to use the Beautiful Soup library for Python. The program that splits the XML files into separate trial files is here; for more information about using Beautiful Soup see The Programming Historian. There are far more trial files than session files: there were 12,959 trials in the 1830s alone. Now that we have one file for each trial, we're ready for the next step.

Tags: | | | | |