Wednesday, August 20, 2008

Traces of Use

When he figured that April was the cruelest month, I think TS Eliot was off by four. I find that the early summer stretches into an endless vista of exciting possibilities for new research and teaching. I make far too many commitments, all of which come back to haunt me in late August. Other than dropping in to do light maintenance, for example, I haven't had time recently to write much new material for the Programming Historian. The last time that I did, however, I noticed that visitor logs tell an interesting story.

To date, the front page has received around 12 thousand hits, as people arrive at the site and decide what to do next. At that point, most of them leave. They may have ended up there by accident; they may bookmark the site to look at later. The next two sections are prefatory. The first (around 4 thousand visits) suggests why you may want to learn how to program. The second (almost 5 thousand visits) tells you how to install the software that you need to get started. My interpretation is that about a fifth of our visitors are already convinced they want to learn how to program, which I think is a good sign. The actual programming starts in the next section (2 thousand visits) and goes from there (while the number of visitors for subsequent sections slowly drops to about a thousand each). These numbers could be interpreted in various ways, but to me they suggest that (1) historians and other humanists want to learn how to program, (2) good intentions only get you so far, and (3) if you do stick with it, it gets harder gradually.

These are pretty crude metrics, although more informative ones than I'm getting from, say, the sales figures for my award- winning- but- otherwise- neglected- monograph (buy a copy today!) My friends who work in psycholinguistics have much more sophisticated ways of determining how people read and understand text, with devices that track the subject's gaze and estimate the moment-by-moment contents of their short term memory. I want people to get something out of the Programming Historian, but I don't need that level of detail about what they're getting.

In The Social Life of Information, Brown and Duguid have an anecdote about a historian who goes through batches of eighteenth-century letters rapidly by sniffing bundles of them. When asked what he is doing, he explains that letters written during a cholera outbreak were disinfected with vinegar. "By sniffing for the faint traces of vinegar that survived 250 years and noting the date and source of the letters, he was able to chart the progress of cholera outbreaks." Brown and Duguid go on to note that "Digitization could have distilled out the text of those letters. It would, though, have left behind that other interesting distillate, vinegar."

Probably, but not necessarily. Digitization simply refers to the explicit digital representation of something that can be measured. We are content at the moment with devices that take pictures of documents, and those devices have been steadily improving. We wouldn't be as content with the scanning quality of 2002, when The Social Life of Information was published, and we'd, like, totally hate the scanning quality of 1982 or 1962 ... just ask my students when they have to work with microfilm. That said, high resolution infrared spectroscopy makes it possible to build chemical sniffers that outperform human noses. They also make it possible to go through an archive and digitize the smells of every document.

Saying that we can digitize any trace that we can discover and measure isn't the same thing as saying we can discover and measure any trace that we might need at the moment, episodes of CSI notwithstanding. The material world is almost infinitely informative about the past, but the traces that are preserved have nothing to do with our interests and intents. And one shouldn't draw too fine a line between the analog and the digital, because digital representations are always stored on real-world analog devices, something Matt Kirschenbaum explores in his new book Mechanisms.

Tags: | | |

Thursday, August 07, 2008

Arms Races

[Cross-posted to Cliopatria and Digital History Hacks]

Like many people who blog at Blogger, I was recently notified by e-mail that my blog had been identified by their automated classifiers "as a potential spam blog." In order to prove that this was not the case, I had to log in to one of their servers and request that my blog be reviewed by a human being. The e-mail went on to say "Automatic spam detection is inherently fuzzy, and occasionally a blog like yours is flagged incorrectly. We sincerely apologize for this error." The author of the e-mail knew, of course, that if my blog were sending spam then his or her e-mail would fall on deaf ears (as it were)... you don't have to worry about bots' feelings. The politeness was intended for me, a hapless human caught in the crossfire in the war of intelligent machines.

That same week, a lot of my e-mails were also getting bounced. Since I have my blog address in my .sig file, I'm guessing that may have something to do with it. Alternately, my e-mail address may have been temporarily blocked as the result of a surge in spam being sent from GMail servers. This to-and-fro, attack against counter-attack, Spy vs. Spy kind of thing can be irritating for the collaterally damaged but it is good news for digital historians, as paradoxical as that may seem.

One of the side effects of the war on spam has been a lot of sophisticated research on automated classifiers that use Bayesian or other techniques to categorize natural language documents. Historians can use these algorithms to make their own online archival research much more productive, as I argued in a series of posts this summer.

In fact, a closely related arms race is being fought at another level, one that also has important implications for the digital humanities. The optical character recognition (OCR) software that is used to digitize paper books and documents is also being used by spammers to try and circumvent software intended to block them. This, in turn, is having a positive effect on the development of OCR algorithms, and leading to higher quality digital repositories as a collateral benefit. Here's how.
  • Computer scientists create the CAPTCHA, a "Completely Automated Public Turing test to tell Computers and Humans Apart." In essence, it shows a wonky image of a short text on the screen, and the (presumably human) user has to read it and type in the characters. If they match, the system assumes a real person is interacting with it.
  • Google releases the Tesseract OCR engine that they use for Google Books as open source. On the plus side, a whole community of programmers can now improve Tesseract OCR. On the minus side, a whole community of spammers can put it to work cracking CAPTCHAs.
  • In the meantime, a group of computer scientists comes up with a brilliant idea, the reCAPTCHA. Every day, tens of millions of people are reading wonky images of short character strings and retyping them. Why not use all of these infinitesimal units of labor to do something useful? The reCAPTCHA system uses OCR errors for its CAPTCHAs. When you respond to a reCAPTCHA challenge, you're helping to improve the quality of digitized books.
  • The guys with white hats are also using OCR to crack CAPTCHAs, with the aim of creating stronger challenges. One side effect is that the OCR gets better at recognizing wonky text, and thus better for creating digital books.

Tags: | |