- Tagged_final (944 files, 9 folders)
- Tagged_1830s_Files (62 files)
- Tagged_1840s_Files (120 files)
- Tagged_1910s_Files (41 files)
The next step is to split each of these XML files into individual trials. Our overall strategy will be as follows. First we want to create a directory for the trial files if one doesn't already exist. Then we will get a list of all of the XML files for the decade and step through them one at a time. For each XML file, we're going to extract each trial and save it as a separate file. Since a given trial is delimited with tags that look like <trial id="t-18341124-1" n="1"> ... </trial>, we can parse it out and save it separately as 't-18341124-1.txt'. You can read this trial online at the Old Bailey archives. You can also have a look at the XML file to see what we're dealing with. The fact that the OB team provides XML makes this archive an awesome resource for digital historians, and other online sites should do the same forthwith.
There are a variety of ways to parse XML, but it is quick and easy to use the Beautiful Soup library for Python. The program that splits the XML files into separate trial files is here; for more information about using Beautiful Soup see The Programming Historian. There are far more trial files than session files: there were 12,959 trials in the 1830s alone. Now that we have one file for each trial, we're ready for the next step.
Tags: archive | data mining | digital history | feature space | machine learning | text mining