Digital History Hacks (2005-08): A Naive Bayesian in the Old Bailey, Part 5

Thursday, June 12, 2008

A Naive Bayesian in the Old Bailey, Part 5

With most of our support routines in place, we need to think about the problem of training a machine learner and then assessing its performance. A human being has already gone through each of the trials and assigned one or more offence categories to it:

this trial is a burglary, which is a kind of theft
this trial is also a burglary
this trial is a wounding, which is a way of breaking the peace
...

So we can give each raw trial to our learner and ask it to decide what offence category the trial belongs to, then we can check our learner's answer against the human-assigned category. If we do enough of these trials, we can get a precise sense of how good our learner is.

Most machine learning researchers use a holdout method to test the performance of their learning algorithms. They use part of the data to train the system, then test its performance on the remaining part, the part that wasn't used for training. Items are randomly assigned to either the training or the testing pile, with the further stipulation that both piles should have the same distribution of examples. Since burglaries made up about 2.153% (279/12959) of the trials in the 1830s, we want burglaries to make up about two percent of the training data and about two percent of the test data. It would do us no good for all of the burglaries to end up in one pile or the other.

But how do we know whether the results that we're seeing are some kind of fluke? We use cross-validation. We randomly divide our data into a number of piles (usually 10), making sure that the category that we are interested in is uniformly distributed across those piles. Now, we set aside the first pile and use the other nine piles to train our learner. We then test it on the first pile and record its performance. We then set aside the second pile for testing, and use the other nine piles for training a new learner. And so on, until each item has been used both for testing and for training. We can then average the ten error estimates. There are many other methods in the literature, of course, but this one is fairly standard.

Code to create a tenfold cross-validation sample from our data is here. As a check, we'd also like to make sure that our offence category is reasonably distributed across our sample (code for that is here).

Tags: archive | data mining | digital history | feature space | machine learning | text mining

Digital History Hacks (2005-08)

Thursday, June 12, 2008

A Naive Bayesian in the Old Bailey, Part 5

William J. Turkel

Blog Archive

The Programming Historian

Digital Historians / Humanists

Digital History / Humanities

Hacking