- this trial is a burglary, which is a kind of theft
- this trial is also a burglary
- this trial is a wounding, which is a way of breaking the peace
- ...
Most machine learning researchers use a holdout method to test the performance of their learning algorithms. They use part of the data to train the system, then test its performance on the remaining part, the part that wasn't used for training. Items are randomly assigned to either the training or the testing pile, with the further stipulation that both piles should have the same distribution of examples. Since burglaries made up about 2.153% (279/12959) of the trials in the 1830s, we want burglaries to make up about two percent of the training data and about two percent of the test data. It would do us no good for all of the burglaries to end up in one pile or the other.
But how do we know whether the results that we're seeing are some kind of fluke? We use cross-validation. We randomly divide our data into a number of piles (usually 10), making sure that the category that we are interested in is uniformly distributed across those piles. Now, we set aside the first pile and use the other nine piles to train our learner. We then test it on the first pile and record its performance. We then set aside the second pile for testing, and use the other nine piles for training a new learner. And so on, until each item has been used both for testing and for training. We can then average the ten error estimates. There are many other methods in the literature, of course, but this one is fairly standard.
Code to create a tenfold cross-validation sample from our data is here. As a check, we'd also like to make sure that our offence category is reasonably distributed across our sample (code for that is here).
Tags: archive | data mining | digital history | feature space | machine learning | text mining