[Cross-posted to Cliopatria & Digital History Hacks]
Given that relatively few of our colleagues are familiar with digital history yet--and that those of us who practice some form of it aren't sure what to call it: digital history? history and computing? digital humanities?--it may seem a bit perverse to start talking about computational history. Nevertheless, it's an idea that we need, and the sooner we start talking and thinking about it, the better.
From my perspective, digital history simply refers to the idea that many of our potential sources are now online and available on the internet. It is possible, of course, to expand this definition and tease out many of its implications. (For more on that, see the forthcoming interchange on "The Promise of Digital History" in the September 2008 issue of The Journal of American History). To some extent we're all digital historians already, as it is quickly becoming impossible to imagine doing historical research without making use of e-mail, discussion lists, word processors, search engines, bibliographical databases and electronic publishing. Some day pretty soon, the "digital" in "digital history" is going to sound redundant, and we can drop it and get back to doing what we all love.
Or maybe not. By that time, I think, it will have become apparent that having networked access to an effectively infinite archive of digital sources, and to one another, has completely changed the nature of the game. Here are a few examples of what's in store.
Collective intelligence. Social software allows large numbers of people to interact efficiently and focus on solving problems that may be too difficult for any individual or small group. Does this sound utopian? Present-day examples are easy to find in massive online games, open source software, and even the much-maligned Wikipedia. These efforts all involve unthinkably complex assemblages of people, machines, computational processes and archives of representations. We have no idea what these collective intelligences will be capable of. Is it possible for an ad hoc, international, multi-lingual group of people to engage in a parallel and distributed process of historical research? Is it possible for a group to transcend the historical consciousness of the individuals that make it up? How does the historical reasoning of a collective intelligence differ from the historical reasoning of more familiar kinds of historian?
Machines as colleagues. Most of us are aware that law enforcement and security agencies routinely use biometric software to search through databases of images and video and identify people by facial characteristics, gait, and so on. Nothing precludes the use of similar software with historical archives. But here's the key point. Suppose you have a photograph of known provenance, depicting someone in whom you have an interest. Your biometric software skims through a database of historical images and matches your person to someone in a photo of a crowd at an important event. If the program is 95% sure that the match is valid, are you justified in arguing that your person was in the crowd that day?
Archives with APIs. Take it a step further. Most online archives today are designed to allow human users to find sources and read and cite them in traditional ways. It is straightforward, however, for the creators of these archives to add an application programming interface (API), a way for computer programs to request and make use of archival sources. You could train a machine learner to recognize pictures of people, artifacts or places and turn it loose on every historical photo archive with an API. Trained learners can be shared amongst groups of colleagues, or subject as populations to a process of artificial selection. At present, APIs are most familiar in the form of mashups, websites that integrate data from different sources on-the-fly. The race is on now to provide APIs for some of the world's most important online archival collections.
Models. Agent-based and other approaches from complex adaptive systems research are beginning to infiltrate the edges of the discipline, particularly amongst researchers more inclined toward the social sciences. Serious games appeal to a generation of researchers that grew up with not-so-serious ones. People who might once have found quantitative history appealing are now building geographic information systems. In every case, computational processes become tools to think with. I was recently at the Metropolis on Trial conference, loosely organized around the 120 million word online archive of the Old Bailey proceedings. At the conference, historians talked and argued about sources and interpretations, of course, but also about optical character recognition and statistical tables and graphs and search results generated with tools on the website. We're not yet at a point where these discussions involve much nuanced analysis of layers of computational mediation... but it is definitely beginning.
Tags: computational history | digital history
Sunday, July 20, 2008
Thursday, July 03, 2008
A Naive Bayesian in the Old Bailey, Part 14
I'm off to England next week to present some of this work at the Metropolis on Trial conference, so it is time to bring this series of posts to a close. I'd like to wrap up by summarizing what we've accomplished and making a clearer case for machine learning as a tool for historical research.
Papers in the machine learning literature often say something like "we tested learners x, y, and z on this standard data set and found errors of 40%, 20% and 4% respectively. Learner z should therefore be used in this situation." The value of such research isn't immediately apparent to the working historian. For one thing, many of the most powerful machine learning algorithms require the learner to be given all of the training data at once. Historians, on the other hand, tend to encounter sources piecemeal, sometimes only recognizing their significance in retrospect. Training a machine learner usually requires a labelled data set: each item already has to be categorized. It's not obvious what good a machine learner is, if the researcher has to do all the work in advance. Finally, there is the troublesome matter of errors. What good is a system that screws up one judgement in ten? Or one in four?
In this work we considered a situation that is already becoming familiar to historians. You have access to a large archive of sources in digital form. These may consist of raw OCR text (full of errors), or they may be edited text, or, best of all, they may be marked up with XML, as in the case of the Old Bailey trials. Since most of us are not lucky enough to work with XML-tagged sources very often, I stripped out the tags to make my case more strongly.
Now suppose you know exactly what you're looking for, but no one has gone through the sources yet to create an index that you can use. In a traditional archive, you might be limited to starting at the beginning and plowing through the documents one at a time, skimming for whatever you're interested in. If your archive has been digitized you have another option. You can use a traditional search engine to index the keywords in the documents. (You could, for example, download them all to your own computer and index them with Google Desktop. Or you could get fancy with something like Lucene.) Unless your topic has very characteristic keywords, however, you will be getting a mix of relevant and irrelevant results with every search. Under many conditions, a keyword search is going to return hundreds or thousands of hits, and you are back to the point of going through them one at a time.
Suppose you're interested in larceny. (To make my point, I'm picking a category that the OB team has already marked up, but the argument is valid for anything that you or anyone else can reliably pick out. You might be studying indirect speech, or social deference, or the history of weights and measures. As long as you can look at each document and say "yes, I'm interested in this" or "no, I'm not interested in this" you can use this technique.) Anyway, you start with the first trial of 24 Nov 1834. It is a burglary, so you throw it in the "no" pile. The next record is a burglary, the third is a wounding, and so on. After you skim through 1,000 trials, you've found 444 examples of larceny and 556 examples of trials that weren't larceny. If you kept track of how long it took you to go through those thousand trials, you can estimate how long it will take for you to get through the remaining 11,959 trials in the 1830s, and approximately how many more cases of larceny you are likely to find. But you're less than a tenth of the way through the decade's trials, and no further ahead on the remaining ones.
Machine learning gives you a very powerful alternative, as we saw in this series. The naive bayesian learner isn't the most accurate or precise one available, but it has a couple of enormous advantages for our application. First of all, it is relatively easy to understand and to implement. Although we didn't make use of this characteristic, it is also possible to stop the learner at any point and find out which features it thinks are most significant. Second, the naive bayesian is capable of incremental learning. We can train it with a few labelled items, then test it on some unlabelled items, then train it some more. Let's go back to the larceny example. Suppose as you look at each of the thousand trials, you hand it off to your machine learner along with the label that you've assigned. So once you decide the first trial is a burglary, you give it to the learner along with the label "no". (This doesn't have to be laborious... the process could easily be built into your browser, so as you review a document, you can click a plus or minus button to label it for your learner.) Where are you after 1,000 trials? Well, you've still found your 444 examples of larceny and your 556 examples of other offence categories. But at this point, you've also trained a learner that can look through the next 11,959 trials in a matter of seconds and give you a pile containing about 2,500 examples of larceny and about 750 false positives. That means that the next pile of stuff that you look through has been "enriched" for your research. Only 44% of the first thousand trials you looked at were examples of larceny. Almost 77% of the next three thousand trials you look at will be examples of larceny, and the remaining 23% will be more closely related offences. Since the naive bayesian is capable of online learning, you can continue to train it as you look through this next pile of data.
Machine learning can be a powerful tool for historical research because
Papers in the machine learning literature often say something like "we tested learners x, y, and z on this standard data set and found errors of 40%, 20% and 4% respectively. Learner z should therefore be used in this situation." The value of such research isn't immediately apparent to the working historian. For one thing, many of the most powerful machine learning algorithms require the learner to be given all of the training data at once. Historians, on the other hand, tend to encounter sources piecemeal, sometimes only recognizing their significance in retrospect. Training a machine learner usually requires a labelled data set: each item already has to be categorized. It's not obvious what good a machine learner is, if the researcher has to do all the work in advance. Finally, there is the troublesome matter of errors. What good is a system that screws up one judgement in ten? Or one in four?
In this work we considered a situation that is already becoming familiar to historians. You have access to a large archive of sources in digital form. These may consist of raw OCR text (full of errors), or they may be edited text, or, best of all, they may be marked up with XML, as in the case of the Old Bailey trials. Since most of us are not lucky enough to work with XML-tagged sources very often, I stripped out the tags to make my case more strongly.
Now suppose you know exactly what you're looking for, but no one has gone through the sources yet to create an index that you can use. In a traditional archive, you might be limited to starting at the beginning and plowing through the documents one at a time, skimming for whatever you're interested in. If your archive has been digitized you have another option. You can use a traditional search engine to index the keywords in the documents. (You could, for example, download them all to your own computer and index them with Google Desktop. Or you could get fancy with something like Lucene.) Unless your topic has very characteristic keywords, however, you will be getting a mix of relevant and irrelevant results with every search. Under many conditions, a keyword search is going to return hundreds or thousands of hits, and you are back to the point of going through them one at a time.
Suppose you're interested in larceny. (To make my point, I'm picking a category that the OB team has already marked up, but the argument is valid for anything that you or anyone else can reliably pick out. You might be studying indirect speech, or social deference, or the history of weights and measures. As long as you can look at each document and say "yes, I'm interested in this" or "no, I'm not interested in this" you can use this technique.) Anyway, you start with the first trial of 24 Nov 1834. It is a burglary, so you throw it in the "no" pile. The next record is a burglary, the third is a wounding, and so on. After you skim through 1,000 trials, you've found 444 examples of larceny and 556 examples of trials that weren't larceny. If you kept track of how long it took you to go through those thousand trials, you can estimate how long it will take for you to get through the remaining 11,959 trials in the 1830s, and approximately how many more cases of larceny you are likely to find. But you're less than a tenth of the way through the decade's trials, and no further ahead on the remaining ones.
Machine learning gives you a very powerful alternative, as we saw in this series. The naive bayesian learner isn't the most accurate or precise one available, but it has a couple of enormous advantages for our application. First of all, it is relatively easy to understand and to implement. Although we didn't make use of this characteristic, it is also possible to stop the learner at any point and find out which features it thinks are most significant. Second, the naive bayesian is capable of incremental learning. We can train it with a few labelled items, then test it on some unlabelled items, then train it some more. Let's go back to the larceny example. Suppose as you look at each of the thousand trials, you hand it off to your machine learner along with the label that you've assigned. So once you decide the first trial is a burglary, you give it to the learner along with the label "no". (This doesn't have to be laborious... the process could easily be built into your browser, so as you review a document, you can click a plus or minus button to label it for your learner.) Where are you after 1,000 trials? Well, you've still found your 444 examples of larceny and your 556 examples of other offence categories. But at this point, you've also trained a learner that can look through the next 11,959 trials in a matter of seconds and give you a pile containing about 2,500 examples of larceny and about 750 false positives. That means that the next pile of stuff that you look through has been "enriched" for your research. Only 44% of the first thousand trials you looked at were examples of larceny. Almost 77% of the next three thousand trials you look at will be examples of larceny, and the remaining 23% will be more closely related offences. Since the naive bayesian is capable of online learning, you can continue to train it as you look through this next pile of data.
Machine learning can be a powerful tool for historical research because
- It can learn as a side effect of your research process at very little cost to you
- You can stop the system at any point to see what it has learned, getting an independent measure of a concept of interest
- You can use it at any time to "look ahead" and find items that it thinks that you will be interested in
- Its false positive errors are often instructive, giving you a way of finding interesting things just beyond the boundaries of your categories
- A change in the learner's performance over time might signal a historically significant change or discontinuity in your sources
Tuesday, July 01, 2008
A Naive Bayesian in the Old Bailey, Part 13
So far, we've only been working with the Old Bailey trials of the 1830s, almost thirteen thousand in total. It would be nice to know if our learner continues to perform well as we give it more testing data. In the following runs, I trained a TFIDF-50 learner for each offence category that was attested more than 10 times in the 1830s. The training data consisted of all of the trials from the decade, labelled and presented to the learner in chronological order. Training was then stopped, and each learner was tested on the 25,403 unlabelled trials of the 1840s, also presented in chronological order. In order to assess the learners' performance, I used the same measures that we developed earlier, comparing the ratio of misses to hits (accuracy) and the ratio of false positives to hits (precision). As before, I added one to the denominator, so as not to accidentally divide by zero. (Computers hate it when you do that.)
The results for the accuracy measure are shown below, in the form of a bar graph rather than the scatterplot-style figure we used before. In this graph and the next one, we can see that the performance of the learner is about as good for data that it hasn't seen (i.e., the 1840s trials) as it is for the data that were used to train it. Most of the measures are around two or less, which is comparable to what we saw before. The performance has actually improved for many of the offence categories, like assault, fraud, perjury, conspiracy, kidnapping, receiving and robbery. We do notice, however, some performance degradation for a number of sexual offences, including sexual assault with sodomitical intent, bigamy, indecent assault, rape and sodomy. This might be a statistical anomaly. On the other hand, it might be a sign that the language that was used to describe sexual offences changed somewhat in the 1840s, causing a learner trained on 1830s data to miss later cases. This is one of the ways that tools like machine learning can be used to generate new research questions.
The next figure shows the results for the precision measure. In general the learner makes more false positive errors than misses, which is exactly what we want, given that the false positives can be useful in themselves. We don't see quite the same clear difference between sexual and non-sexual offence categories that we saw with the accuracy measure ... and for some reason it is quite hard for our learner to pick out cases of perverted justice in the 1840s.
Tags: archive | data mining | digital history | feature space | machine learning | text mining
The results for the accuracy measure are shown below, in the form of a bar graph rather than the scatterplot-style figure we used before. In this graph and the next one, we can see that the performance of the learner is about as good for data that it hasn't seen (i.e., the 1840s trials) as it is for the data that were used to train it. Most of the measures are around two or less, which is comparable to what we saw before. The performance has actually improved for many of the offence categories, like assault, fraud, perjury, conspiracy, kidnapping, receiving and robbery. We do notice, however, some performance degradation for a number of sexual offences, including sexual assault with sodomitical intent, bigamy, indecent assault, rape and sodomy. This might be a statistical anomaly. On the other hand, it might be a sign that the language that was used to describe sexual offences changed somewhat in the 1840s, causing a learner trained on 1830s data to miss later cases. This is one of the ways that tools like machine learning can be used to generate new research questions.
The next figure shows the results for the precision measure. In general the learner makes more false positive errors than misses, which is exactly what we want, given that the false positives can be useful in themselves. We don't see quite the same clear difference between sexual and non-sexual offence categories that we saw with the accuracy measure ... and for some reason it is quite hard for our learner to pick out cases of perverted justice in the 1840s.
Tags: archive | data mining | digital history | feature space | machine learning | text mining