Sunday, January 20, 2008

Relevance Feedback

Suppose Albert, Betty and Chris are historians of food, technology and Indonesia, respectively. It's not hard to imagine a scenario where each might sit down in front of Google and type "java" into the search box. One of the key problems of designing a search engine is trying to find a way to order the results so that the highest ranked hits will be relevant to the most users. In this case, let's assume that Google isn't tracking any of the three (i.e., they aren't logged in to GMail or other services and they aren't using their own computers). I just tried this search while logged in to Google and the top 12 results were relevant to the computer language, followed by one hit for the Indonesian island, followed by thirty-seven more for the computer language. I stopped counting. I love coffee, but I don't read about it or buy it online, so it is possible that my searching history helps Google know that I'm probably looking for information about the programming language. It's also possible that most people who use Google are looking for information about the programming language.

Google's default assumption in this case is good news for Betty, and not such good news for Albert or Chris. Each of them could go on to refine their search, of course. One obvious possibility would be to add keywords ("java +coffee") or subtract them ("java -programming") or both. But the fact remains that Betty will find what she is looking for immediately, while the other two won't without more digging. It is easy to see how repeated experiences might shape a person's experience of the web, leading them to see it as a place of scarcity or abundance.

Without knowing more about what a particular searcher is after, it is very difficult to do better than to match the distribution of result relevance to something else that can be measured easily. That may be a measurement of the importance or centrality of sets of documents, or a survey of what users are looking for when they enter popular keywords, or any number of other measures singly or in or combination. Search engine companies can also measure the click-through for particular links. If most people click on one of the results on the first page of hits and then don't repeat or modify their search, the company can infer that the result was probably relevant to the searcher's needs.

Machine learning methods are often categorized as "supervised" or "unsupervised." In the former case, the system gets feedback telling it what is, or even better, what is not, a correct answer. Unsupervised methods don't receive any feedback, which usually makes their task much more difficult. If we cast search engine relevance in these terms, we can see that the system faces a task which is only partially supervised at best.

In informational retrieval systems that were created before the web, users typically learned to construct elaborate queries and to refine their queries based on the results that they received. These systems often included a way for the user to provide relevance feedback. In the context of the web, queries are typically only a word or two long, and most search engines don't include a mechanism for the searcher to provide direct relevance feedback. This may be good enough for web searchers taken as a group (it may even be optimal), but it imposes a cost on individual researchers. Researchers need to be able to find obscure sources, and the best way to do this is to pair them with a system that can learn from relevance feedback. Digital humanists need tools that go beyond the single box search. And we're probably going to have to write them ourselves.

Tags: | |

Monday, January 14, 2008

The Programming Historian

My colleague Alan MacEachern and I have decided to write a book to teach practicing historians how to use programming to augment their ability to do research online. The Programming Historian will be provided as an open access work via the website of NiCHE: Network in Canadian History & Environment. We'll announce the details soon. In the meantime, here are a few things that will make this work different from existing books about programming...

1. We think that you should be able to put what you learn to work in your research practice immediately. Many beginning programmers lose patience because they can't see why they're learning what they're learning.

2. Digital history requires working with sources on the web. This means that you're going to be spending most of your research time working in a browser, so you should be able to use your programming skills in the browser.

3. Our examples will build on real historical sources online and on open source projects in the digital humanities. In particular, the programs that you create will be tightly integrated with Zotero.

4. We'll draw on a wide range of techniques from information retrieval; text, data and web mining; statistical natural language processing; machine learning; and other disciplines.

If you'd like to contact us with questions or comments, there is contact information on our faculty web pages: Turkel & MacEachern.

Tags: | | | | |

Tuesday, January 08, 2008

Results When and Where You Need Them

In my previous post I complained about a taken-for-granted model that carves the research process into discrete stages of information gathering, analysis, writing and publication. As I noted, I don't think that this model really makes sense anymore. I've been trying to figure out where it came from, and more to the point, why it persists.

We all have preferred ways of coming up with explanations, and one of my favorites is to start with an unshakeable belief in the second law of thermodynamics and go from there. In the wake of any event, there are a range of material and documentary sources that can be used to make inferences about what happened. Time continues, however. Memories are reworked, documents are lost, physical evidence decays and is disrupted. Contexts for understanding various pasts change, too, of course. We might even say that "all is flux." Against this inexorable dissolution, we've tried to create little islands of stasis. These include libraries, museums and archives, and also brass plaques, time capsules, heirloom species, national parks, and mathematical laws.

In Into the Cool, Schneider and Sagan summarize the second law by saying that "nature abhors a gradient." To the extent that we don't, we have to pay to maintain them. For example, there are information and transaction costs associated with learning anything. (In this case, the gradient that you are trying to maintain is your own wit and wisdom. If you're reading this, you may find it easier than your waistline, but they're all losing battles in the long run). In the past, these costs were highest for moving historians to distant documents and keeping them near those documents temporarily. When I did archival work and fieldwork for my dissertation, I was acutely aware of the cost of being 3,000 miles from home. I had the sense that it really mattered which box I requested next at the archive, or which place I decided to visit in the field. Many researchers describe having had similar experiences... it's part of the fun, the frisson, of archival work. But the high cost of doing research in the material world forces research time into clumps.

Most academic researchers also have to teach to support themselves, and this introduces another kind of temporal clumping. Research trips are rarely taken during the school year, and writing is often deferred, too. I'm trying hard to suffuse my own research and writing throughout the year, but I'm aware that I went for 25 days without posting to my blog last December, and have written five posts in the last 12 days. I start teaching again tomorrow, attending job talks, and so on.

I'm not going to change costs associated with working in the material world, of course. I'm not going to change the university calendar to a year-round, part-time engagement, either. But to the extent that the digital world changes the landscape of transaction and information costs that we face, it will make a big difference in our shared research model.

As I see it, many of the programs that we are currently using impede the unification of the research process. At a minimum, most historians probably rely on a word processor and web browser. They may also use a spreadsheet, bibliographic database and more specialized programs like an RSS feed reader, relational database, statistical package, GIS, or concordancer. Each of these programs is designed to be "sovereign," to use Alan Cooper's term, to be "the only [program] on the screen, monopolizing the user's attention for long periods of time." The move to Web 2.0 has put a lot of functionality in the browser, and programs like Zotero are clearly a step in the right (only) direction. But the fact remains that most of our own research materials are locked into little silos. Moving from one of these silos to another imposes its own granularity on our activities.

How could this be different? Think of your Zotero bibliography as the core of your research process. Every item in it is there because it is relevant to your work. Suppose you keep your notes and drafts in Zotero, too. Then for the purposes of digital history, a good statistical description of your Zotero database is the best and most up-to-the-minute description of your research process. That description will be more accurate to the extent that you can incorporate other streams of information into it, like the feeds that you read, the books that you purchase, and the research-related web searches that you do. I think that the development of Zotero in the near future will allow more and more of this kind of incorporation, and the fact that the software is open source and provides an API bodes well for using it as a platform for mining. The key point that I want to emphasize, however, is that measurements of your Zotero bibliography will be most useful to the extent that they are fed back into your research in a useful way. Suppose you do a quick analysis of a text that you are in the process of reading. It is quite simple to provide the results of that analysis both as information that you can read, and as a vector that can be used to refine automatic searching or spidering for related material.

Tags: | | | | | | |

Saturday, January 05, 2008

All is Flux

If you wanted a motto for digital history, it's hard to imagine finding anything better than the one that Heraclitus is supposed to have come up with around 500 BCE, when he said something to the effect that 'all is flux' or 'everything flows' or 'you can't step into the same river twice'.

I think that many historians have a research model which looks a bit like this:
  1. Formulate question
  2. Do research
    1. Collect a bunch of sources
    2. Decide which look most promising and skim through those
    3. Read the most relevant ones carefully
    4. Take good notes
  3. Write
  4. Publish

We all agree that the stages of the research process are indistinct and blend into one another. We all agree that there is a lot of movement to-and-fro and back-and-forth, and time for visions and revisions. Nevertheless, this research model--what the heck, let's call it Parmenidean--is widely enough understood that many professors ask their graduate students questions like "Have you done your research yet?" or "When are you going to start writing?" The students, in turn, reply with answers that may please or displease their advisors, but which are understood to be felicitous in the pragmatic sense.

Digital historians, on the other hand, have to be thoroughgoing Heracliteans and reject questions like "Have you done your research yet?" The only sensible way to do research online is to be doing everything all at once all the time. The research model looks like this:
  • Until your interpretation stabilizes...
    • You keep refining your ensemble of questions
    • Your spiders and feeds provide a constant stream of potential sources
    • Unsupervised learning methods reveal clusters which help to direct your attention
    • Adaptive filters track your interests as they fluctuate
    • You create or contribute to open source software as needed
    • You write/publish incrementally in an open access venue
    • Your research process is subject to continual peer review
    • Your reputation develops

Do we have what we need to fully implement this strategy? A lot of the pieces are already in place, including massive textual databases, search engines with APIs, XML, RSS feeds and feed readers, high-level programming languages, and tools for online scholarship like Zotero. The combined literature of statistical natural language processing, text and data mining, machine learning, and information retrieval provide a cornucopia of useful techniques. If you know how to program you're already most of the way there; if not, now is as good a time as any to begin learning how.

Tags: |

Wednesday, January 02, 2008

What's the Opposite of Big History?

A couple of times in this blog, I've mentioned big history, an ambitious attempt to narrate history from the big bang to the present. Like microhistory, the Annales school, environmental history, and a few other thematic approaches to the discipline, one of the things that big history teaches us is that we can learn something different by judiciously manipulating the scale of our inquiry.

By providing us with access to completely new kinds of sources, digital history opens up some additional possibilities for manipulating scale. Consider, for example, the cached data provided by Google and other search engines. When you do a search you have the option of following the provided link, or of seeing what the page looked like when Google's spiders last visited. The date and time that the copy was cached is also provided, and it is straightforward to write a program to retrieve the current page and cached copy and compare them to see what has changed. As a test, I did a Google search for "digital history" on 2 Jan 2008 at 15:05 GMT and recorded the times that the cached copy had been created for each page on the first page of hits. Sorted by duration, the results were: 3 days 8 hours 14 minutes, 3d 10h 24m, 3d 15h 14m, 3d 15h 36m, 4d 14h 50m, 5d 3h 12m, 5d 5h 20m, 5d 8h 24m, and 257d 15h 52m.

Now suppose you wanted to write the history of a very brief interval, say a few hours, minutes or even seconds. In the past, this kind of history--I'm not sure what to call it--would only have been possible for an event like 9/11, the JFK assassination or D-Day. But with access to Google's cache data and some sophisticated data mining tools, it becomes possible to imagine creating rich snapshots of web activity over very short intervals. And to the extent that web activity tracks real world activity and can be used to make inferences about it, it becomes possible to imagine writing the history of one second on earth, or one millisecond, or one microsecond.

Tags: | | |

Tuesday, January 01, 2008

The Search Comes First

In December, I had a chance to visit humanists at a couple of universities in the Boston area and talk about digital history. One kind of question that came up repeatedly was foundational: What do humanists really need to know in order to be more effective online researchers? What should they learn first? What constitutes a baseline literacy? How can digital humanists be introduced into existing departments, or the techniques of digital humanities be added to existing curricula?

Since I started this blog two years ago and began teaching digital history classes, I've had the chance to revisit these questions a number of times. My original answer was that it all begins with search, and I think that still holds. For me, the essence of digital history is the shift to what Roy Rosenzweig called a "culture of abundance." The internet is unimaginably large and growing exponentially. Individual researchers, on the other hand, have a sharply bounded capacity to absorb or make sense of new material.

I think that a lot of historians are resistant to the idea of processing documents computationally, because they think of it as a challenge to, or supplement for, reading. Instead, computation should be seen as a way to augment human abilities. We still need human beings to read and interpret sources, and we must still train our students in traditional philological techniques. There's no getting around the fact, however, that the way that we find sources has drastically changed in the last ten or fifteen years.

According to Search Engine Watch, as of this past summer search engines worldwide were handling about 61 billion searches per month. More than half of these were handled by Google, making its ranking algorithms the most pervasive source of bias in the history of research. It's clear that humanists need to understand how search engines work, and need to be able to parameterize their searches to get the best results. Your ability to do a virtuoso close reading is irrelevant if you can't find the sources to read in the first place. Humanists who wish to place their own material online also need to understand search engine technology, because it is the deciding factor in whether a work can be found, read and cited.

In my conversations last month, the follow-up question was usually whether or not historians and other humanists will need to be able to program computers. I'm not sure about the answer to that. I'm certain that some of them will. The discipline of history is in for some interesting times, as interpretations backed by intensive research in a few archives will be confronted with those backed by machine learning or text mining of massive datasets. My hope is that we'll find a rapprochement... but then I'm an optimist.

Tags: | | | |