Digital History Hacks (2005-08): 2008

Monday, December 29, 2008

Coda

When I began this blog, I had the idea that it would be an integral part of my critical and reflective technical practice. For the past three years, it has served admirably, providing an easy way to share ideas and code and putting me in touch with a wide range of colleagues and new friends. During that time I've tried to stay true to the promise of "hacks," even if I pushed the boundaries of both "digital" and "history". As my technical work has evolved, however, I've begun to feel like this blog is less and less suited to my day-to-day activities. Rather than try and force it to fit, I've decided to build something new.

Tuesday, December 09, 2008

Some Winter Reading for Humanist Makers

(Crossposted to Cliopatria & Digital History Hacks)

In December 2004, I bought a copy of Joe Martin's Tabletop Machining to see what would be involved in learning how to make clockwork mechanisms and automata. It was pretty obvious that I had many years of study ahead of me, but I had just finished my PhD and knew that publishing that would take a few years more. So I didn't mind beginning something else that might take ten or fifteen years to master. Since then, I've been reading steadily about making things, but it wasn't until this past fall that I actually had the chance to set up a small Lab for Humanistic Fabrication and begin making stuff in earnest. Since it's December again, I thought I'd put together a small list of books to help other would-be humanist makers.

Alexander, Christopher. Notes on the Synthesis of Form (Harvard, 1964).
Ball, Philip. Made to Measure: New Materials for the 21st Century (Princeton, 1999).
Barrett, William. The Illusion of Technique (Anchor, 1979).
Basalla, George. The Evolution of Technology (Cambridge, 1989).
Bryant, John and Chris Sangwin. How Round is Your Circle? Where Engineering and Mathematics Meet (Princeton, 2008).
Dourish, Paul. Where the Action Is: The Foundations of Embodied Interaction (MIT, 2004).
Edgerton, David. The Shock of the Old: Technology and Global History since 1900 (Oxford, 2006).
Frauenfelder, Mark and Gareth Branwyn. The Best of MAKE (O'Reilly, 2007).
Gershenfeld, Neil. Fab: The Coming Revolution on Your Desktop--from Personal Computers to Personal Fabrication (Basic, 2007).
Gordon, J. E. Structures: Or Why Things Don't Fall Down (Da Capo, 2003).
Gordon, J. E. The New Science of Strong Materials: Or Why You Don't Fall through the Floor (Princeton, 2006).
Harper, Douglas. Working Knowledge: Skill and Community in a Small Shop (Chicago, 1987).
Igoe, Tom. Making things Talk: Practical Methods for Connecting Physical Objects (Make Books, 2007).
Ingold, Tim. The Perception of the Environment: Essays on Livelihood, Dwelling and Skill (Routledge, 2000).
Marlow, Frank M. Machine Shop Essentials (Metal Arts, 2004).
Martin, Joe. Tabletop Machining (Sherline, 1998).
McDonough, William and Michael Braungart. Cradle to Cradle: Remaking the Way We Make Things (North Point, 2002).
Molotch, Harvey. Where Stuff Comes From: How Toasters, Toilets, Cars, Computers and Many Other Things Come to Be As They Are (Routledge, 2005).
Mims, Forrest M., III. Electronic Sensor Circuits and Projects (Master Publishing, 2004).
Mims, Forrest M., III. Science and Communication Circuits and Projects (Master Publishing, 2004).
Napier, John. Hands (Princeton, 1993).
Oberg, Erik, et al. Machinery's Handbook, 28th ed. (Industrial Press, 2008).
O'Sullivan, Dan and Tom Igoe. Physical Computing: Sensing and Controlling the Physical World with Computers (Thomson, 2004).
Polanyi, Michael. Personal Knowledge (Chicago, 1974).
Powell, John. The Survival of the Fitter (Practical Action, 1995).
Pye, David. The Nature and Art of Workmanship (A&C Black, 2008).
Rathje, William and Cullen Murphy. Rubbish! The Archaeology of Garbage (University of Arizona, 2001).
Schon, Donald A. The Reflective Practitioner: How Professionals Think in Action (Basic, 1984).
Sennett, Richard. The Craftsman (Yale, 2008).
Slade, Giles. Made to Break: Technology and Obsolescence in America (Harvard, 2007).
Sterling, Bruce. Shaping Things (MIT, 2005).
Suchman, Lucy. Human-Machine Reconfigurations: Plans and Situated Action (Cambridge, 2006).
Thackara, John. In the Bubble: Designing in a Complex World (MIT, 2006).
Thompson, Rob. Manufacturing Processes for Design Professionals (Thames & Hudson, 2007).
Woodbury, Robert S. Studies in the History of Machine Tools (MIT, 1973).

Friday, November 21, 2008

A Few Arguments for Humanistic Fabrication

By hooking a computer up to a machine that can add, remove, cut or fuse material, it is possible to turn a digital representation into a physical object. Most historians (at least ones reading this blog) are probably familiar with the idea of digitization; think of this as 'materialization', a reversal of the process. The humble printer is a kind of materializer for two-dimensional text and images. These other machines (often referred to as rapid prototyping or computer-aided manufacturing machines, or even 'replicators') allow their users to make manifest three-dimensional objects of plastic, wood, metal, or fancier composites.

Over the past few years, the price of rapid fabrication has been dropping, well, rapidly. A lab that once cost hundreds of thousands or millions of dollars can now be had for less than $20,000. Enthusiasts predict that the age of desktop fabrication is nigh; in the next few years we will all have devices on our desks that can print out 3D objects. (Neil Gershenfeld's Fab is a good introduction to some of the possibilities.) Small groups of DIY makers and hardware hackers are busy in their garages and attics trying to create a printer that can print a copy of itself, a machine that can print out a flashlight, one that can print a torroidal coil of candy, or burn a message into your morning toast. The popular appeal of all this activity is clear in the pages of MAKE magazine, or in the Discovery Channel's new show, "Prototype This".

There are a number of reasons why historians and other humanists should be getting involved in desktop fabrication right now. Here are a few.

We can't predict the future. In the 1960s, for example, it wasn't clear to everyone that there would ever be much reason for individuals to have the undivided attention of a single computer (never mind the dozens that we each now monopolize without thinking about it.) In retrospect, the people who struggled to get individual access to computers, who bought them from mail-order catalogs and built them at home, who taught themselves how to program even when that meant reading thick manuals and punching cards... well, now we know how that turned out. Using a computer-controlled soldering iron to fuse grains of sugar into candy sculptures may seem a bit tangential to the serious business of academia, but it's really too soon to judge.

Mind and hand. Just because the separation between thinking and making is longstanding and well-entrenched doesn't make it a good idea. At various times in the past, humanists have been deeply involved in making stuff: Archimedes, the Banu Musa brothers, da Vinci, Vaucanson, the Lunar Men, Bauhaus, W. Grey Walter, Gordon Mumma. The list could easily be multiplied into every time and place, but the main point is that getting your hands dirty might be worthwhile, even if you're not da Vinci.

Historic experimentation. People who work with material culture, the history of technology or experimental archaeology know that you can learn a lot about the past by handling physical stuff. Until recently, that usually meant that you needed to have direct access to the stuff itself. Now it is possible to fabricate physical models or artifacts that share properties with possibly rare or priceless originals. Paleontologists and zooarchaeologists can learn from 3D printouts of bones and fossils. Historians of science can more readily replicate past experiments. And so on.

Tangible / haptic history. More generally, it will become possible to materialize shapes, surfaces, textures and artifacts that resemble those of the past, and that can be touched, felt, handled, and manipulated. It is easy to imagine a new tangible or haptic history that follows and extends the sensory histories that are being written right now.

Critical technical practice. In the late 1990s, Philip Agre argued for a mode of research that involved both "the craft work of design and ... the reflexive work of critique." The benefits of this approach are already apparent in the digital humanities, where historians, anthropologists, archaeologists, artists, literary and media scholars, and their colleagues are busy both creating and critiquing digital sources. Why not extend this practice to rapid fabrication, microelectronics, new materials, robotics or nanotechnology?

Some of the barriers are easily overcome. When someone asks me why a historian would need an 8-axis CNC milling machine or an oscilloscope, I say, "Why not?" The limitations of our physical spaces can be more difficult to circumvent. Most of the teaching and research environments available to humanists at my university are designed to support solitary or small-group office work. These spaces are almost comically unsuitable for the kinds of things I try to do with my students: soldering, moldmaking and casting, building and lighting physical exhibits, programming in groups, creating displays or signage. Although I could afford to purchase a laser cutter, I can't vent the poisonous fumes from my workspace. Cutting wood with power tools will set off the fire alarm. I certainly couldn't set up a little foundry to explore the bootstrapping process that led from metal casting to machine tools. There isn't even anywhere to lock up student project prototypes so they won't be stolen or vandalized. When I have a chance to talk to planners or people purchasing furniture or whatever, I ask them to imagine spaces that are appropriate for an art class or a shop class: high ceiling, natural light, plenty of ventilation, cement flooring, workbenches on casters, locking cabinets, big blank walls that you can hang things on. No carpeting, no beige cubicles, no coffee tables with plants. Humanists won't be able to think of themselves as makers until we create spaces for them to make things in.

Tags: bricolage | DIY | fabrication | hacking | physical computing

Saturday, November 08, 2008

Hemlines and History Appliances

[Crossposted to Cliopatria & Digital History Hacks]

The Stock Market Skirt is a robot of sorts. Created a number of years ago by Toronto-based media artist Nancy Patterson, it consists of a party dress on a dressmaker's mannequin and a number of monitors displaying stock tickers. As prices fluctuate, "these values are sent to a program which determines whether to raise or lower the hemline via a stepper motor and a system of cables, weights and pulleys attached to the underside of the skirt. When the stock price rises, the hemline is raised; when the stock price falls, the hemline is lowered." I can only assume that the edge of the dress is rumpled up on the floor these days, and that the motors are somewhat the worse for wear.

The exhibit, of course, is a playful reinterpretation of George Taylor's hemline index. In the 1920s, Taylor, an economist at the Wharton school, observed that skirt lengths were correlated with the state of the economy. Since then, the observation has continued to be relatively robust, and these days has been extended into many other domains, like music and movie preferences, the water content in foods, and even the shapes of Playboy playmates.

I think the stock market skirt is a great example of what I call a "history appliance." The idea is supposed to be whimsical: what if a device could dispense historical consciousness the way a tap dispenses water? I've found that academic historians have a much harder time entertaining this question than public historians do. After all, the latter have a long tradition of trying to build events, exhibits and situations that communicate interpretations of the past in ways that supplement the written word. A diorama, for example, represents the past faithfully along some dimensions, but not all. You can do scientific tests on an artifact--if it isn't a fake, its material substance can be informative about past events. (Ditto if it is a fake.) You can't necessarily do scientific tests on a diorama, and yet it is possible for it to communicate information about the past veridically.

For a historian, the correlation between stock prices and hemlines raises questions of agency, and we feel comfortable exploring those on paper. Nothing foregrounds agency like a robot, however, and historians shouldn't shy away from building them into their historical interpretations.

Tags: history appliances | public history | thing knowledge

Sunday, November 02, 2008

The Bridge Goes Both Ways

This week I found myself in a somewhat unfamiliar situation. Along with Randy Shifflett and Fabio Lopez-Lazaro, I was asked to represent the discipline of history at a community building meeting of the LIKES (Living in the KnowlEdge Society) project at Virginia Tech. There, surrounded by computer scientists, engineers and other 'hard' scientists, we had to explain some of the challenges that face people who wish to integrate computation into historical research and teaching. In many ways, it was a return to fundamentals. We explained that many facts about the past are readily quantified, but that doing so often misses the point. Historical examples raised by our non-historian colleagues often focused on names and dates, and we had to tell them that the really interesting action is usually elsewhere. We reviewed ideas of contingency, counterfactual reasoning, and ambiguity. We explained why it usually doesn't make sense to project anachronistic categories and ideas onto past situations. We discussed the holism and methodological individualism of most researchers in our field.

When asked what kind of computational tools historians and other humanists need, the best metaphor that I could come up with drew on Jim Clifford's ideas of travel and translation. It would be easy to make tools that quantified how many miles you traveled on your vacation, how many feet you were standing from the sculpture when you took the picture, how you rated your meal in Venice on a scale of 1 to 10 ... but it would completely miss the point. Instead you want ways to help you translate, to capture and document your experiences, to cue your memories, to support your storytelling, to deepen your interpretations and understanding.

In this blog, I've assumed that most of my audience would be historians and other humanists who are interested in exploring digital and computational techniques at a number of levels. The LIKES meeting reminded me that the bridge goes both ways, that computer scientists, applied mathematicians, science educators and others are also interested in ways that their skills and tools might be applied in new domains. So, for those of you coming in the other direction: welcome! Here are a few things you might be interested to know:

Historians who are interested in quantification already know about and use spreadsheets, databases, mathematical models, computer programs and visualization. Historians who aren't interested in quantification won't be happy with a definition of 'qualitative' that consists of "leaving the numerical scale off of the axes of your graph."
The best way to promote computation amongst humanists is to emphasize social and textual applications of computing, especially ones that augment the power of individuals to do research that draws on collections of cultural / heritage materials that are distributed across many different repositories.
Verbs, not nouns. John Unsworth's paper on scholarly primitives is a good place to start.
There are a number of good books that take a humanistic perspective while still being sensitive to the potential of instrumental thinking. I particularly like Philip Agre's Computation and Human Experience and Lucy Suchman's Human-Machine Reconfigurations.

Tags: digital history

Monday, October 06, 2008

The One True Language

In Seven Nights, Borges has an essay where he describes the process by which he first read the Divine Comedy:

They were very handy books, published by Dent. They fit into my pocket. On the left was the Italian text, and on the right a literal translation. I devised this modus operandi: I first read a verse, a tercet, in the English prose; then I read the verse in Italian; and so on through to the end of the canto. Then I read the whole canto in English, and finally in Italian. With that first reading I realized that the translations were no substitute for the original text. The translation could be, at best, a means and a stimulus for the reader to approach the original. ... Poetry is, among so many other things, an intonation, an accentuation that is often untranslatable.

I was recently reminded of this because I decided that my digital history grad class should use the Processing programming language for their group project. Since I haven't programmed in the language before, I bought a couple of textbooks and sat down to read them, slowing when I needed to mentally translate unfamiliar commands into more familiar idioms.

Beginning programmers often worry about which language to learn first. Which one is the most powerful? The most useful? The easiest to learn? Which one will help me to get a high paying job? The investment of a semester or a year seems like a long time to study something when it might turn out to be the wrong choice. At a theoretical level, programming languages are deeply equivalent, but that is more a matter of theory than practice... because every programming language makes some things easy and some things hard. Or in the slogan of one language, "makes easy things easy and hard things possible." These language differences become the stuff of holy wars, but they shouldn't. The best language for the job depends largely on the job.

Processing, for instance, has built-in commands that make it easy to map numbers from one range of values to another. Now this isn't something that is too difficult to program from more primitive commands; it comes up frequently enough that you learn how to do it in whatever language you're using. But when I read the description of the Processing commands, I realized that I have implemented similar functions in almost every language that I've ever programmed in. By choosing to make this a language primitive, the designers of Processing made it easier for beginners to do a number of different tasks, including scaling the ranges of values returned by different analog sensors (which is something my students will need to do).

There's no one true language for programming any more than there is one true language for humanism, or one true wood for carpenters. As Borges says, the intonations of poetry are often untranslatable, and it's true for code, too. In a sense, you don't really know how to program until you're familiar with more than one language, because the essence of programming consists in knowing how to translate the idioms of one language into a more or less familiar one. And this is something that humanists have long known: if there is a oneness and truth to language, it is to be found in the multiple practices of translation.

Tags: programming

Friday, October 03, 2008

Navigating Digital History

This year, one of the first slides that I put up for the new Science, Technology and Global History class that Rob MacDougall and I are teaching was a quote from Patrick Manning's Navigating World History:

Navigating world history is an ambitious but limited goal, one quite distinct from the unattainable aim of "mastering" the topic. No one can learn all of world history. Anyone who pursues such a goal is sure to become lost. To strike an analogy, all those who have attempted to conquer the world have failed, but many of those who have traveled the globe have gained pleasure and expanded their understanding. (x)

I originally intended this forewarning as a way of managing expectations. I figured the students wouldn't be so disappointed in me when they found out that one consequence of taking on the history of everything from the Big Bang to human extinction is that the sum of the prof's knowledge asymptotically approaches zero. The students, however, seem to be taking my relative ignorance in stride, and the quote has mostly served to console me when I have to leave some stuff out of my lectures.

I was reminded of navigation the other day when I met with a PhD student who is close to finishing her doctorate and thinking about her second project. She wants to do something with digital sources but is having a hard time getting her bearings. Our conversation made me realize that I didn't have a single-page "getting started" guide for people who have never seriously worked with online sources. So here it is.

1. You won't be able to read everything. In fact, new material on your topic will appear online faster than you can read it. The longer you work on a topic, the more behind you will get. It's OK, because everyone faces this problem whether they realize it or not.

2. The first tool you should master is the search engine. Most people think that typing a word or two into the Google or Yahoo! search box is all that you need to know. Not so! First of all, search engines have an advanced search page that lets you focus in on your topic, exclude search terms, weight some terms more than others, limit your results to particular kinds of document, to particular sites, to date ranges, and so on. Second, different search engines introduce different kinds of bias by ranking results differently. You get a better view when you routinely use more than one.

3. You should have a strategy for information trapping. An explicit search is something that you do once, but the web is constantly changing. By using RSS feeds it is possible to set up a number of searches that run automatically and provide you with a constantly updated view of your subject. You can learn more about the technique in Tara Calishain's Information Trapping.

4. You can organize citations right in your browser. Until you start doing advanced work in digital history, you will access almost all of your online sources through your web browser. If you use Zotero, you can keep track of those sources in your browser, too. It really speeds up the research process.

5. It is possible to automate the process of downloading sources. There are a number of tools that make it easy to grab large batches of online sources without having to download them one at a time. In the Firefox browser, for example, you can use something like DownThemAll. Another option is GNU Wget.

6. The web is not structured like a ball of spaghetti. A lot of the most interesting information to be gleaned from digital sources lies in the hyperlinks leading into and out of various nodes, whether personal pages, documents, archives, institutions, or what have you. Search engines provide some rudimentary tools for mapping these connections, but much more can be learned with more specialized tools.

7. Assume that what you want to know is out there, and go looking for it.

Tags: digital history | history education | pedagogy

Sunday, September 21, 2008

Hello World!

It's traditional when learning a new programming language to have your first program simply say "Hello world!" and terminate. It not only boosts your confidence, it signals that you've got all of the basics in place: an editor to create programs, an interpreter or compiler that can follow the instructions that you've programmed, and a way for the information to get out of the program and the computer into a form where you can make use of it. What you do at that point is up to you... hopefully more programming.

Over the last few years, I've been wrapping up my first book project: a study of how people reconstruct the past from various kinds of physical traces. My interest in the ways that material evidence and places can inform historical consciousness, and a growing interest in the potential of digital and public history, have led me to a related set of research questions. How can we use new technologies like ubiquitous / pervasive computing, ambient and tangible interfaces, and desktop fabrication to build historical interpretations into physical devices and environments? What happens when all of the bits that we've been creating through various kinds of digitization can become material atoms again? And how can this help us to better understand various pasts and make them usable in the present?

For a couple of years I've been doing projects with my students and research assistants that use technology to augment everyday places and objects, to put historical interpretations back into stuff. These projects have made use of GPS-enabled handheld and tablet computers; microcontrollers, analog sensors and actuators; and other electronic technologies. Up until now, however, we've had to buy physical components or fashion them by hand.

Last week, Adam, Devon and I had a chance to set up our new Roland Modela MDX-20 and try making something with it, a kind of physical "hello world."

The MDX-20 is a (relatively simple) computer-controlled milling machine. It is able to move a rapidly spinning, sharp tool in three dimensions, gradually removing material from a solid block, so that it comes to precisely resemble a three-dimensional model in the computer. What this means is that something that is almost purely virtual can be materialized in foam, plastic, wood, and other soft physical media. (Although our first efforts look pretty chunky, the machine is capable of much more precise contours--we have a lot to learn.) The MDX-20 also has a scanning probe, which we haven't had a chance to test yet. When used in scanning mode, the MDX-20 automates the creation of 3D models from physical objects. This allows you to start with one or more objects in the real world, scan them to create 3D models, edit or remix as desired, then replicate them in material form. At this point, the possibilities seem nearly endless.

In Thing Knowledge, the philosopher Davis Baird argues that "Things and theory can both constitute our knowledge of the world." Things can serve as models, physical representations that act in a similar way to theories. They can create phenomena, separating action "from human agency and buil[ding it] into the reliable behavior of an artifact." Or they can serve as measuring instruments, combining both representation and work (11-12). There's a long tradition of ignoring things to focus on ideas, cyberspace being one of many guises for idealism. It's time for digital humanists to say, "Hello, world!"

Tags: digitization | fabrication | history appliances | stepwise refinement

Saturday, September 06, 2008

Practices, Not Products

The first week of school is a good time to expect to see Murphy's law in action. This year my server suddenly decided to start falling over at diminishing intervals. A couple of weeks of sporadic debugging have left me where I started: with an unreliable server. All of the code and images for this blog are hosted there, so they are temporarily unavailable, and I had to scramble a bit to find a new online home for the group project that my students will be doing in their grad class in digital history. This summer, too, the keepers of my old standby the online Dictionary of Canadian Biography suddenly decided to overhaul their website. While I applaud the fact that they moved from Active Server Pages to PHP, I'm not so happy that so many of the code examples in Digital History Hacks and the Programming Historian have to be revised.

I've solved my server problem for the time being, or more accurately, sidestepped it, by moving a lot of my online stuff to a new home: digitalhistory.wikispot.org. In the process I was reminded again that wikis really are the fastest and most awesome way to get your stuff online in a form that is durable but plastic enough to be continually reshaped. I can thank Raymond Yee for the inspiration. Although I've used a number of online tools, it didn't occur to me that a wiki can replace most of them until I saw Raymond give a talk at THATCamp. Rather than bust out an Open Office presentation or something like that, Raymond pointed his browser to his own wiki, a "working space / public knowledge repository". He had already entered some of the material that he wanted to talk about, and as he gave his presentation he continued to edit. When his presentation was over, he clicked 'save' and everything was already available online.

The beauty of a wiki, as many people have noted, is that it allows online material to grow quickly and organically. Rather than try to build my new online presence in one pass, I was able to sketch the outlines of what I wanted to add. Now, every time I look at the site, I see a whole bunch of work that still needs to be done. I can chip away at it, rethink, reorganize, and everything remains available to other people. On some of the pages I've roughed out sections for my students or research assistants to fill in; I expect them to chip away, rethink and reorganize, too. In effect, wiki software can provide scaffolding for practices. There's no real final product, just the most recent edit. (And, of course, access to the entire history of edits).

This year, Rob MacDougall and I are teaching a new course on science, technology and global history, and I find myself in the (exciting? unenviable?) position of writing my lectures the week before I give them. A lot of my projects feel like they may be on hold until November, when I can hand the lecturing off to Rob and start to deal with some of the changes that have broken things that used to work. I can't feel too bothered, however. All is flux, especially on the internet. The trick is to find the techniques and tools that help you deal gracefully with change, to think in clay and not in stone.

Tags: Dictionary of Canadian Biography | digital history | entropy | wikis

Wednesday, August 20, 2008

Traces of Use

When he figured that April was the cruelest month, I think TS Eliot was off by four. I find that the early summer stretches into an endless vista of exciting possibilities for new research and teaching. I make far too many commitments, all of which come back to haunt me in late August. Other than dropping in to do light maintenance, for example, I haven't had time recently to write much new material for the Programming Historian. The last time that I did, however, I noticed that visitor logs tell an interesting story.

To date, the front page has received around 12 thousand hits, as people arrive at the site and decide what to do next. At that point, most of them leave. They may have ended up there by accident; they may bookmark the site to look at later. The next two sections are prefatory. The first (around 4 thousand visits) suggests why you may want to learn how to program. The second (almost 5 thousand visits) tells you how to install the software that you need to get started. My interpretation is that about a fifth of our visitors are already convinced they want to learn how to program, which I think is a good sign. The actual programming starts in the next section (2 thousand visits) and goes from there (while the number of visitors for subsequent sections slowly drops to about a thousand each). These numbers could be interpreted in various ways, but to me they suggest that (1) historians and other humanists want to learn how to program, (2) good intentions only get you so far, and (3) if you do stick with it, it gets harder gradually.

These are pretty crude metrics, although more informative ones than I'm getting from, say, the sales figures for my award- winning- but- otherwise- neglected- monograph (buy a copy today!) My friends who work in psycholinguistics have much more sophisticated ways of determining how people read and understand text, with devices that track the subject's gaze and estimate the moment-by-moment contents of their short term memory. I want people to get something out of the Programming Historian, but I don't need that level of detail about what they're getting.

In The Social Life of Information, Brown and Duguid have an anecdote about a historian who goes through batches of eighteenth-century letters rapidly by sniffing bundles of them. When asked what he is doing, he explains that letters written during a cholera outbreak were disinfected with vinegar. "By sniffing for the faint traces of vinegar that survived 250 years and noting the date and source of the letters, he was able to chart the progress of cholera outbreaks." Brown and Duguid go on to note that "Digitization could have distilled out the text of those letters. It would, though, have left behind that other interesting distillate, vinegar."

Probably, but not necessarily. Digitization simply refers to the explicit digital representation of something that can be measured. We are content at the moment with devices that take pictures of documents, and those devices have been steadily improving. We wouldn't be as content with the scanning quality of 2002, when The Social Life of Information was published, and we'd, like, totally hate the scanning quality of 1982 or 1962 ... just ask my students when they have to work with microfilm. That said, high resolution infrared spectroscopy makes it possible to build chemical sniffers that outperform human noses. They also make it possible to go through an archive and digitize the smells of every document.

Saying that we can digitize any trace that we can discover and measure isn't the same thing as saying we can discover and measure any trace that we might need at the moment, episodes of CSI notwithstanding. The material world is almost infinitely informative about the past, but the traces that are preserved have nothing to do with our interests and intents. And one shouldn't draw too fine a line between the analog and the digital, because digital representations are always stored on real-world analog devices, something Matt Kirschenbaum explores in his new book Mechanisms.

Tags: analog | clues | digitization | representation

Thursday, August 07, 2008

Arms Races

[Cross-posted to Cliopatria and Digital History Hacks]

Like many people who blog at Blogger, I was recently notified by e-mail that my blog had been identified by their automated classifiers "as a potential spam blog." In order to prove that this was not the case, I had to log in to one of their servers and request that my blog be reviewed by a human being. The e-mail went on to say "Automatic spam detection is inherently fuzzy, and occasionally a blog like yours is flagged incorrectly. We sincerely apologize for this error." The author of the e-mail knew, of course, that if my blog were sending spam then his or her e-mail would fall on deaf ears (as it were)... you don't have to worry about bots' feelings. The politeness was intended for me, a hapless human caught in the crossfire in the war of intelligent machines.

That same week, a lot of my e-mails were also getting bounced. Since I have my blog address in my .sig file, I'm guessing that may have something to do with it. Alternately, my e-mail address may have been temporarily blocked as the result of a surge in spam being sent from GMail servers. This to-and-fro, attack against counter-attack, Spy vs. Spy kind of thing can be irritating for the collaterally damaged but it is good news for digital historians, as paradoxical as that may seem.

One of the side effects of the war on spam has been a lot of sophisticated research on automated classifiers that use Bayesian or other techniques to categorize natural language documents. Historians can use these algorithms to make their own online archival research much more productive, as I argued in a series of posts this summer.

In fact, a closely related arms race is being fought at another level, one that also has important implications for the digital humanities. The optical character recognition (OCR) software that is used to digitize paper books and documents is also being used by spammers to try and circumvent software intended to block them. This, in turn, is having a positive effect on the development of OCR algorithms, and leading to higher quality digital repositories as a collateral benefit. Here's how.

Computer scientists create the CAPTCHA, a "Completely Automated Public Turing test to tell Computers and Humans Apart." In essence, it shows a wonky image of a short text on the screen, and the (presumably human) user has to read it and type in the characters. If they match, the system assumes a real person is interacting with it.
Google releases the Tesseract OCR engine that they use for Google Books as open source. On the plus side, a whole community of programmers can now improve Tesseract OCR. On the minus side, a whole community of spammers can put it to work cracking CAPTCHAs.
In the meantime, a group of computer scientists comes up with a brilliant idea, the reCAPTCHA. Every day, tens of millions of people are reading wonky images of short character strings and retyping them. Why not use all of these infinitesimal units of labor to do something useful? The reCAPTCHA system uses OCR errors for its CAPTCHAs. When you respond to a reCAPTCHA challenge, you're helping to improve the quality of digitized books.
The guys with white hats are also using OCR to crack CAPTCHAs, with the aim of creating stronger challenges. One side effect is that the OCR gets better at recognizing wonky text, and thus better for creating digital books.

Tags: machine learning | optical character recognition (OCR) | Turing test

Sunday, July 20, 2008

Towards a Computational History

[Cross-posted to Cliopatria & Digital History Hacks]

Given that relatively few of our colleagues are familiar with digital history yet--and that those of us who practice some form of it aren't sure what to call it: digital history? history and computing? digital humanities?--it may seem a bit perverse to start talking about computational history. Nevertheless, it's an idea that we need, and the sooner we start talking and thinking about it, the better.

From my perspective, digital history simply refers to the idea that many of our potential sources are now online and available on the internet. It is possible, of course, to expand this definition and tease out many of its implications. (For more on that, see the forthcoming interchange on "The Promise of Digital History" in the September 2008 issue of The Journal of American History). To some extent we're all digital historians already, as it is quickly becoming impossible to imagine doing historical research without making use of e-mail, discussion lists, word processors, search engines, bibliographical databases and electronic publishing. Some day pretty soon, the "digital" in "digital history" is going to sound redundant, and we can drop it and get back to doing what we all love.

Or maybe not. By that time, I think, it will have become apparent that having networked access to an effectively infinite archive of digital sources, and to one another, has completely changed the nature of the game. Here are a few examples of what's in store.

Collective intelligence. Social software allows large numbers of people to interact efficiently and focus on solving problems that may be too difficult for any individual or small group. Does this sound utopian? Present-day examples are easy to find in massive online games, open source software, and even the much-maligned Wikipedia. These efforts all involve unthinkably complex assemblages of people, machines, computational processes and archives of representations. We have no idea what these collective intelligences will be capable of. Is it possible for an ad hoc, international, multi-lingual group of people to engage in a parallel and distributed process of historical research? Is it possible for a group to transcend the historical consciousness of the individuals that make it up? How does the historical reasoning of a collective intelligence differ from the historical reasoning of more familiar kinds of historian?

Machines as colleagues. Most of us are aware that law enforcement and security agencies routinely use biometric software to search through databases of images and video and identify people by facial characteristics, gait, and so on. Nothing precludes the use of similar software with historical archives. But here's the key point. Suppose you have a photograph of known provenance, depicting someone in whom you have an interest. Your biometric software skims through a database of historical images and matches your person to someone in a photo of a crowd at an important event. If the program is 95% sure that the match is valid, are you justified in arguing that your person was in the crowd that day?

Archives with APIs. Take it a step further. Most online archives today are designed to allow human users to find sources and read and cite them in traditional ways. It is straightforward, however, for the creators of these archives to add an application programming interface (API), a way for computer programs to request and make use of archival sources. You could train a machine learner to recognize pictures of people, artifacts or places and turn it loose on every historical photo archive with an API. Trained learners can be shared amongst groups of colleagues, or subject as populations to a process of artificial selection. At present, APIs are most familiar in the form of mashups, websites that integrate data from different sources on-the-fly. The race is on now to provide APIs for some of the world's most important online archival collections.

Models. Agent-based and other approaches from complex adaptive systems research are beginning to infiltrate the edges of the discipline, particularly amongst researchers more inclined toward the social sciences. Serious games appeal to a generation of researchers that grew up with not-so-serious ones. People who might once have found quantitative history appealing are now building geographic information systems. In every case, computational processes become tools to think with. I was recently at the Metropolis on Trial conference, loosely organized around the 120 million word online archive of the Old Bailey proceedings. At the conference, historians talked and argued about sources and interpretations, of course, but also about optical character recognition and statistical tables and graphs and search results generated with tools on the website. We're not yet at a point where these discussions involve much nuanced analysis of layers of computational mediation... but it is definitely beginning.

Tags: computational history | digital history

Thursday, July 03, 2008

A Naive Bayesian in the Old Bailey, Part 14

I'm off to England next week to present some of this work at the Metropolis on Trial conference, so it is time to bring this series of posts to a close. I'd like to wrap up by summarizing what we've accomplished and making a clearer case for machine learning as a tool for historical research.

Papers in the machine learning literature often say something like "we tested learners x, y, and z on this standard data set and found errors of 40%, 20% and 4% respectively. Learner z should therefore be used in this situation." The value of such research isn't immediately apparent to the working historian. For one thing, many of the most powerful machine learning algorithms require the learner to be given all of the training data at once. Historians, on the other hand, tend to encounter sources piecemeal, sometimes only recognizing their significance in retrospect. Training a machine learner usually requires a labelled data set: each item already has to be categorized. It's not obvious what good a machine learner is, if the researcher has to do all the work in advance. Finally, there is the troublesome matter of errors. What good is a system that screws up one judgement in ten? Or one in four?

In this work we considered a situation that is already becoming familiar to historians. You have access to a large archive of sources in digital form. These may consist of raw OCR text (full of errors), or they may be edited text, or, best of all, they may be marked up with XML, as in the case of the Old Bailey trials. Since most of us are not lucky enough to work with XML-tagged sources very often, I stripped out the tags to make my case more strongly.

Now suppose you know exactly what you're looking for, but no one has gone through the sources yet to create an index that you can use. In a traditional archive, you might be limited to starting at the beginning and plowing through the documents one at a time, skimming for whatever you're interested in. If your archive has been digitized you have another option. You can use a traditional search engine to index the keywords in the documents. (You could, for example, download them all to your own computer and index them with Google Desktop. Or you could get fancy with something like Lucene.) Unless your topic has very characteristic keywords, however, you will be getting a mix of relevant and irrelevant results with every search. Under many conditions, a keyword search is going to return hundreds or thousands of hits, and you are back to the point of going through them one at a time.

Suppose you're interested in larceny. (To make my point, I'm picking a category that the OB team has already marked up, but the argument is valid for anything that you or anyone else can reliably pick out. You might be studying indirect speech, or social deference, or the history of weights and measures. As long as you can look at each document and say "yes, I'm interested in this" or "no, I'm not interested in this" you can use this technique.) Anyway, you start with the first trial of 24 Nov 1834. It is a burglary, so you throw it in the "no" pile. The next record is a burglary, the third is a wounding, and so on. After you skim through 1,000 trials, you've found 444 examples of larceny and 556 examples of trials that weren't larceny. If you kept track of how long it took you to go through those thousand trials, you can estimate how long it will take for you to get through the remaining 11,959 trials in the 1830s, and approximately how many more cases of larceny you are likely to find. But you're less than a tenth of the way through the decade's trials, and no further ahead on the remaining ones.

Machine learning gives you a very powerful alternative, as we saw in this series. The naive bayesian learner isn't the most accurate or precise one available, but it has a couple of enormous advantages for our application. First of all, it is relatively easy to understand and to implement. Although we didn't make use of this characteristic, it is also possible to stop the learner at any point and find out which features it thinks are most significant. Second, the naive bayesian is capable of incremental learning. We can train it with a few labelled items, then test it on some unlabelled items, then train it some more. Let's go back to the larceny example. Suppose as you look at each of the thousand trials, you hand it off to your machine learner along with the label that you've assigned. So once you decide the first trial is a burglary, you give it to the learner along with the label "no". (This doesn't have to be laborious... the process could easily be built into your browser, so as you review a document, you can click a plus or minus button to label it for your learner.) Where are you after 1,000 trials? Well, you've still found your 444 examples of larceny and your 556 examples of other offence categories. But at this point, you've also trained a learner that can look through the next 11,959 trials in a matter of seconds and give you a pile containing about 2,500 examples of larceny and about 750 false positives. That means that the next pile of stuff that you look through has been "enriched" for your research. Only 44% of the first thousand trials you looked at were examples of larceny. Almost 77% of the next three thousand trials you look at will be examples of larceny, and the remaining 23% will be more closely related offences. Since the naive bayesian is capable of online learning, you can continue to train it as you look through this next pile of data.

Machine learning can be a powerful tool for historical research because

It can learn as a side effect of your research process at very little cost to you
You can stop the system at any point to see what it has learned, getting an independent measure of a concept of interest
You can use it at any time to "look ahead" and find items that it thinks that you will be interested in
Its false positive errors are often instructive, giving you a way of finding interesting things just beyond the boundaries of your categories
A change in the learner's performance over time might signal a historically significant change or discontinuity in your sources

Tuesday, July 01, 2008

A Naive Bayesian in the Old Bailey, Part 13

So far, we've only been working with the Old Bailey trials of the 1830s, almost thirteen thousand in total. It would be nice to know if our learner continues to perform well as we give it more testing data. In the following runs, I trained a TFIDF-50 learner for each offence category that was attested more than 10 times in the 1830s. The training data consisted of all of the trials from the decade, labelled and presented to the learner in chronological order. Training was then stopped, and each learner was tested on the 25,403 unlabelled trials of the 1840s, also presented in chronological order. In order to assess the learners' performance, I used the same measures that we developed earlier, comparing the ratio of misses to hits (accuracy) and the ratio of false positives to hits (precision). As before, I added one to the denominator, so as not to accidentally divide by zero. (Computers hate it when you do that.)

The results for the accuracy measure are shown below, in the form of a bar graph rather than the scatterplot-style figure we used before. In this graph and the next one, we can see that the performance of the learner is about as good for data that it hasn't seen (i.e., the 1840s trials) as it is for the data that were used to train it. Most of the measures are around two or less, which is comparable to what we saw before. The performance has actually improved for many of the offence categories, like assault, fraud, perjury, conspiracy, kidnapping, receiving and robbery. We do notice, however, some performance degradation for a number of sexual offences, including sexual assault with sodomitical intent, bigamy, indecent assault, rape and sodomy. This might be a statistical anomaly. On the other hand, it might be a sign that the language that was used to describe sexual offences changed somewhat in the 1840s, causing a learner trained on 1830s data to miss later cases. This is one of the ways that tools like machine learning can be used to generate new research questions.

The next figure shows the results for the precision measure. In general the learner makes more false positive errors than misses, which is exactly what we want, given that the false positives can be useful in themselves. We don't see quite the same clear difference between sexual and non-sexual offence categories that we saw with the accuracy measure ... and for some reason it is quite hard for our learner to pick out cases of perverted justice in the 1840s.

Friday, June 27, 2008

A Naive Bayesian in the Old Bailey, Part 12

Up until now, we've measured the error rates of our various learners without worrying too much about what good an error-prone machine learner actually is. By dividing the learner's responses into the four categories of hit, miss, false positive and correct negative, we can get a more nuanced picture of what it is doing when it makes a mistake. Here we look at false positives, trials that the learner mistakenly identifies as belonging to the category of interest. We start by writing a program that goes through each of the TFIDF-50 learner's responses for the various offence categories in the 1830s. It collects all of the false positives, making a note of what offence category each trial actually belongs to. The code to do this is here. We can then plot the information in a convenient form. I've decided to use pie charts.

The figure below shows the results for the offence category of assault, coded as a way of breaking the peace. What happens when our learner thinks that a trial is an example of this category but it really isn't? About 38.6% of the time, the trial in question was actually categorized as indecent assault (sexual), and about 38.6% of the time it was assault with intent (also sexual). Almost 11% of the time, the trial was a case of assault with sodomitical intent, and another 8% of the trials were actually categorized as an instance of wounding. In other words, about 96% of the learner's false positive "errors" in this case were other kinds of assault. What of the trials classified as "miscellaneous - other"? One was this trial, where 44 year old William Blackburn was found guilty of "unlawfully and maliciously administering to Hannah Mary Turner 6 drachms of tincture of cantharides, with intent to excite, &c." I understand that this case probably doesn't fit the definition of assault used by either Blackburn's contemporaries or by the person who coded the file. Nevertheless, it is not completely unrelated to the idea of an assault, and is exactly the kind of source that a historian could use to shed light on gender relations, sexuality, or other topics.

The next figure shows the false positives for fraud, categorized as a kind of deception. Seventy-two percent of the learner's false positives in this case were actually categorized as coining offences, and another 12% were actually cases of forgery. Once again, the vast majority of cases that were incorrectly identified as fraud belonged to relatively closely related offence categories. Note that these results cannot be explained by appealing to the distribution of offences in the sample as a whole. If the false positives were selected by the learner at random, we would expect most of them to be cases of larceny, which are by far the most commonly attested. Instead we see that a learner trained to recognize one kind of assault is confused by other kinds of assault, and one trained on fraud by other kinds of fraud.

A learner trained on manslaughter is mostly confused by cases of wounding and murder, as shown in the next figure.

Finally we can consider a kind of theft, in this case housebreaking. If any learner were going to be confused by larceny cases, it should be one trained to recognize a type of theft. Instead, this learner is more confused by the less-frequently attested but more closely related categories of burglary and theft from place.

Now we are in a position to provide one kind of answer to the question, "what good is an error-prone learner?" Since the learner's errors are meaningfully related to its successful ability to categorize, we can use false positives as a way of generalizing beyond the bounds of hard and fast categorization. If we used a search engine to find cases of assault we might miss some of the most interesting such cases (like the cantharides example) ... cases that are interesting precisely because they lay just outside the category. One of the things that machine learning gives us, is a way of finding some of the more interesting exceptions to our rules.

Tags: archive | data mining | digital history | feature space | machine learning | text mining

Wednesday, June 25, 2008

A Naive Bayesian in the Old Bailey, Part 11

We feel pretty confident that the performance of the TFIDF-50 version of the naive bayesian learner is going to be relatively stable regardless of the frequency with which a particular offence is attested. At this point we can write a routine which tests the learner on each of the offences which occurred 10 or more times in the 1830s. Our testing routine takes advantage of the fact that, unlike many other kinds of machine learner, the naive bayesian can be operated in online mode. What this means is that we can train the learner on some data, test its performance, then train it on some more data. Many learners can only be operated in offline or batch mode. This means they have to be trained on all of the data before they can be tested, and there is no way at that point to subject them to further training. The fact that the naive bayesian can be used for online learning will turn out to be crucial for us.

The code for testing is here. The learner is given the trials in chronological order, one at a time. The way that the program works is that it first uses the current state of the learner to classify a trial. The classification is scored as a hit, miss, false positive or correct negative, then the trial is used to train the learner (with the appropriate category being given as feedback). The learner is then given the next trial to judge. Once the learner has seen all of the data, the final count of hits, misses, etc. is output and the performance plotted as in previous posts. The results are shown below for the 1830s.

As can be seen, the performance is pretty stable, considering that different offences make up values ranging between 0.077% (for perverting justice 10/12959) and 42.48% of the total (for simple larceny 5505/12959). The system gets very few false positives for bigamy, and quite a few for shoplifting. We'll look at why this is the case in the next post. It is very accurate for the most frequently attested offence, simple larceny, and relatively inaccurate for the infrequently attested offences of kidnapping (11/12959) and perverting justice (10/12959). The central part of the plot is magnified and shown in the figure below. The performance of the learner varies for similar sorts of crime (e.g., it performs better for indecent assault than assault), something that we will take up next.

Sunday, June 22, 2008

A Naive Bayesian in the Old Bailey, Part 10

In our last post, we settled on a style of plotting that shows both how accurate our learner is (i.e., does it miss very often?) and how precise (i.e., how often does it return a false positive?) We also decided to do experiments with the version of the naive bayesian learner that uses the items with the highest tf-idf as features. Our experiments to date have used the category of simple larceny in the 1830s. This offence is very well-attested, making up about 42.5% of the trials (5505/12959). At this point, we can try the performance of the same learner on offence categories that are less frequent: stealing from master (1718/12959, approx. 13.3%) and burglary (279/12959, approx 2.2%). We've been using the 15 terms with the highest tf-idf, but we should try some other values for that parameter, too. A graph for the three different offence categories is shown below. The four learners use the top scoring 15, 30, 50 and 100 items, respectively.

From the graph, it is pretty clear that it is easiest to learn to categorize larceny, which is the best-attested offence we looked at. We can also see that the TFIDF-15 learner does particularly poorly by missing many instances of the less frequent offences. Increasing the number of features the learner can make use of seems to improve performance up to a point. After that, increasing features increases the number of false positives the learner makes. We want the performance of our learner to be relatively robust when learning offence categories that are more or less frequently attested, which means we want the learner with the tightest grouping of results for these test categories (in other words, TFIDF-50).

Note that in this test, we only ran each learner once on each data set, rather than doing ten-fold cross-validation. Our experiments with cross-validation suggested that the different versions of the learner were relatively insensitive to the order in which training and testing trials were presented. Since this is exploratory work, we will make the (possibly incorrect) assumption that a single trial is probably representative. This will let us do a lot more testing in the same amount of time.

Tags: archive | data mining | digital history | feature space | machine learning | text mining

Saturday, June 21, 2008

A Naive Bayesian in the Old Bailey, Part 9

There are many different ways to measure the performance of our various learning algorithms. The error rate that we've been using so far we defined as the sum of misses and false positives divided by the total number of trials. By this measure, COINFLIP had an average error rate around 50%, and our naive bayesian learner had an error rate around 40% using one word features, and around 26% using either 2-grams or top-scoring tf-idf features. I thought I might be able to get better performance by using only those 2-grams that included terms with a high tf-idf, but that learner had an error rate around 26%, too. (Recall that we've been using cases of simple larceny in the 1830s for our experiments... the performance will be different for other offences and/or other decades. We'll test some of these soon.)

By using a different measure, we can see that our various learners achieve their results in different ways. From our perspective as researchers, the least interesting category of answers are the correct negatives. Misses are a problem, because they may contain evidence that relates to the argument that we're trying to construct. False positives are a problem, because they are irrelevant but we have to look through them to determine that... in other words, they're a waste of time. A perfect learner would return all and only hits. If we consider the ratio of misses to hits we can get an idea of how accurate our learner is. As a learner gets better, the ratio of misses to hits approaches 0. As it gets worse, the ratio increases. A disastrous learner might not get any hits, so to avoid a division by zero error, we'll add one to the denominator. Our accuracy measure is thus misses / (hits + 1). If we consider the ratio of false positives to hits we can find out how precise our learner is. As it gets better, this ratio will go to zero, and as it gets worse, the ratio will increase. Our precision measure is false positives / (hits + 1). We can plot both measures on the same graph, with the origin in the lower left hand corner, as shown below. Since some of the values are large, I've used logarithmic axes. (Also, the results for YES and NO actually lie on the respective zero lines, but I've bumped them over so they can be seen in this plot.)

Looking at the graph we notice some interesting results. The naive bayesian that uses words for features gets relatively few false positives, but at the cost of missing an order of magnitude more items than the other two learners. The 2-gram learner outperforms COINFLIP and the tf-idf learner on false positives, but not on misses. The tf-idf learner is the only one that outperforms COINFLIP in terms of both accuracy and precision. Thus we will do our next round of experiments with the tf-idf learner.

Tags: archive | data mining | digital history | feature space | machine learning | text mining

Wednesday, June 18, 2008

A Naive Bayesian in the Old Bailey, Part 8

In the last post, we got a naive bayesian learner working and used it to categorize some Old Bailey trials from the 1830s as examples of larceny (or not). Our initial version of the learner was easy to implement, but it made the unrealistic assumption that the probabilities of particular words appearing in the text of a trial were independent. That greatly simplified computation at the cost of performance. Our initial learner had an error rate around 40%. We then revised it to use 2-grams as features rather than individual words. This captured some of the dependency between words, improving our average error rate so it was close to 25%.

An alternative approach is to try and concentrate on the words in a trial which are most representative of a particular category. Without specifying these words in advance, we can make the assumption that they will be relatively frequent in the document in question, but relatively infrequent in the overall corpus of documents. One common measure for this is known as tf-idf. Rather than handing all of the words in a given trial to our learner, or all except the stop words, we will only hand off the 15 or 20 with the highest tf-idf. There are many different ways to compute this measure. The version that I used is tfidf = log(tf+1.0) * log(numdocs/df), where tf is the number of times the word occurs in a particular text, numdocs is the total number of documents, and df is number of documents that the word appears in. The word "cellar," for example, appears in this trial seventeen times, and in 221 other trials in the 1830s. The tfidf for this word in this trial is log(17+1) * log(12959/221) = 11.76781.

To compute the tf-idf, we first need to create a list of every word that was used in all of the trials, and the number of different trials in which each word appears. We could put this information in a text file, but the file would be huge and very slow to access. Instead, we will store our document frequencies in a SQLite database, using Python commands to store and retrieve the information. The code which creates this database is here. We can then compute the tf-idf scores for each word in a given trial, creating a new directory to store these files. The code to do that is here. Finally, we will want a version of our tenfold cross-validation routine to test the performance of a naive bayesian learner that operates across tf-idf vectors rather than raw words or 2-grams (here). This new learner has similar performance to the 2-gram version, with an average error rate of 25.73% when using the 15 highest scoring tf-idf terms to categorize cases of larceny in the 1830s. As a bonus, it is remarkably fast. At this point, you're probably wondering what good a machine learner is, if one quarter of its judgments are incorrect. We'll get there.

Tags: archive | data mining | digital history | feature space | machine learning | text mining

Tuesday, June 17, 2008

A Naive Bayesian in the Old Bailey, Part 7

At last we're in a position to actually train and test some machine learners. The one that we'll start with is called a naive bayesian. It is relatively simple to implement, although it usually doesn't perform nearly as well as fancier and more complicated learners. For our purposes, however, it has some real advantages, which we'll get to spelling out eventually. The version of the naive bayesian learner that I am going to use is the one that was implemented by Toby Segaran in his book Programming Collective Intelligence. I won't post the code for the learner here, as it is already available online. If you are able to follow this series of posts and are interested in writing machine learning code in Python, Toby's book is a must-have. The only change that I have implemented is to remove stop words before submitting the trials for training or testing. You can get instructions and code for that from The Programming Historian.

Bayesian learners make use of a theorem proposed by Thomas Bayes and published in 1763, two years after his death (for more on Bayes, see Bellhouse's biography.) The theorem states that Pr[H|E] = (Pr[E|H] * Pr[H]) / Pr[E]. Pr[H|E] is the probability that the hypothesis H is true, given some evidence E. Pr[E|H] is the probability that you would see evidence E if the hypothesis H were true. Pr[H] is the probability of the hypothesis and Pr[E] the probability of the evidence. Bayes theorem gives us a way of determining conditional probabilities: if we know one thing, how likely are we to know something else?

Let's work through a simple example. Suppose bag A contains one black marble and three white ones, and bag B contains two white marbles and two black ones. Someone gives us a black marble but doesn't remember which bag they took it from. Given that you have a black marble, what are the chances that it came from bag A? In this case, Pr[H] is the probability the marble came from bag A. Since each bag contains the same number of marbles, Pr[H] = 4/8 = 1/2. Pr[E] is the probability that a marble is black, so Pr[E] = (1+2)/8 = 3/8. Pr[E|H] is the probability that you are going to get a black marble if you choose from bag A, in other words Pr[E|H] = 1/4. So Bayes theorem says that Pr[H|E] = (1/4*1/2) / 3/8 = 1/3. Since we know that the marble had to come from one of the two bags, that means that it should have a 2/3 chance of coming from bag B, which we can double check. Pr[notH|E] = (Pr[E|notH] * Pr[notH]) / Pr[E] = (2/4*1/2) / 3/8 = 2/3, as expected. You can learn more about Bayes theorem here.

When applied to the problem of learning, Bayes theorem looks like this: Pr[category|document] = Pr[document|category] * Pr[category]. (We don't need to divide by Pr[document] in this equation because it will scale all of our results by the same amount). We make the (incorrect) assumption that the probability of each word in the document is independent from the others, so we can set Pr[document|category] equal to Pr[word1|category] * Pr[word2|category] * ... Finally, Pr[category] is simply the proportion of all documents that belong to our category of interest.

So how well does the naive bayesian learner do? Not very well. In a tenfold cross-validation run testing for cases of simple larceny in the 1830s it has an average error rate of 39.17%, compared with COINFLIP's average error rate of 49.39%. The error rate is simply (Misses + False Positives) / Total Number of Trials. Part of the problem is that we made the assumption that the probability of any word in a document is independent of the probability of any other word in the same document. We know this isn't strictly true. In the Old Bailey proceedings, for example, you find both "dwelling" and "dwelling house", as well as "victualling house", "sessions house", "station house", "house keeper" and many other forms. To the extent that these and other words tend to co-occur, the word probabilities can't be independent. We can improve the performance of our naive bayesian learner by using pairs of words (i.e., 2-grams) rather than individual words as features for the learner. This drops the error rate to 26.23% when categorizing trials for simple larceny in the 1830s. The code that tests the different learners is here. A graph of performance is shown below.

Friday, June 13, 2008

A Naive Bayesian in the Old Bailey, Part 6

Now that we have our training and testing samples, we will be able to estimate the error rates of our various machine learners. Some of them won't be very good, especially if they are trained on relatively small or unrepresentative samples. None of them will be perfect, or even approach human performance. So it is usually a good idea to ask if the performance of a given learner is significantly different from chance. Consider three other abstract machines which don't do any learning at all.

YES is a very simple machine. When given an item and asked whether or not it is an instance of a particular category, YES says "yes". That's it. Suppose we have 100 test items and all of them are instances of our category, say 100 examples of burglary. We ask YES about each of them and it 'decides' that each is a burglary. YES makes no errors at all on this test sample! If half of the test items are not burglaries, however, YES's error rate climbs to 50%.

NO is also a very simple machine, responding "no" whenever tested. If we give it 100 examples of burglaries, it will fail to recognize every single one of them, with an error rate of 100%. The fewer burglaries our test sample contains, the better NO does.

COINFLIP is more sophisticated than YES or NO. Every time we ask COINFLIP to make a decision, it has a 50% chance of responding "yes" and a 50% chance of responding "no". Given a sample with 100 examples of burglaries, COINFLIP gets it wrong about half the time. Given a sample with no burglaries in it, COINFLIP will also have an error rate around 50%.

With these three simple machines, we can be more clear about what it means to be right or wrong, distinguishing four categories:

Hit. If the machine says "yes" and the right answer is "yes", we say that it has scored a hit. This is one kind of correct answer. Both YES and COINFLIP are capable of scoring hits, but NO never is, because it can never say "yes" to anything.
False Positive. If the machine says "yes" but the answer is really "no", we say that it has responded with a false positive, which is one kind of incorrect answer. YES and COINFLIP can reply with false positives, but NO cannot.
Miss. If the machine says "no" but the correct answer was "yes", we say that it missed. NO and COINFLIP can miss, but YES cannot, because it never says "no".
Correct Negative. This happens when the machine says "no" and the correct answer was "no". NO and COINFLIP can reply with correct negatives, but YES cannot.

We expect our learners to produce answers in each of the four categories. A machine that always hits will also tend to identify a lot of false positives. This can be good if you are looking for a needle in a haystack, but will overwhelm you if your category is well-attested. A machine that always identifies correct negatives will often miss things. These kind of machines tend to be more useful when you would never have time to go through all of your items by hand. Most machine learners have parameters that allow you to tune their performance between these extremes.

Tags: archive | data mining | digital history | feature space | machine learning | text mining