- First of all, we'll want lists of all of the files that need to be processed in a given decade. We could use the operating system for this, but Windows is pretty slow when you have tens of thousands of files in a directory. A program to grab the list of filenames to another text file is here.
- We're also going to want a list of all of the dates on which trials occurred (in other words, we will want a list of all of the days that the court was in session). The program to generate that list and sort it in ascending order is here.
- Since our initial experiments will be focused on trying to automatically categorize trials by offence (e.g., "burglary"), we are going to need a few routines that make it easier to work with offences. One of these needs to return a mapping from trial IDs to one or more categories of offence (the code is here):
- t-18341124-1.txt -> theft-burglary
- t-18341124-2.txt -> theft-burglary
- t-18341124-3.txt -> breakingpeace-wounding
- ...
- t-18341124-37.txt -> theft-stealingfrommaster, theft-simplelarceny
- ...
- Another routine needs to return a mapping from a particular offence to a list of matching trial IDs (the code is here):
- theft-burglary -> t-18341124-1.txt, t-18341124-183.txt, t-18341124-185.txt, t-18341124-2.txt, t-18341124-4.txt, ...
- Finally, we are going to need to have some idea of how many offences there were of each kind in a particular decade (the code is here). For the 1830s, the data look like the following:
- breakingpeace-assault.txt|51
- breakingpeace-libel.txt|7
- breakingpeace-riot.txt|5
- breakingpeace-threateningbehaviour.txt|4
- breakingpeace-wounding.txt|166
- breakingpeace.txt|1
- damage-arson.txt|7
- damage-other.txt|1
- ...
Tags: archive | data mining | digital history | feature space | machine learning | text mining