Tuesday, July 31, 2007

Putting It in Your Own Words

When we teach history students how to take notes for research, we usually tell them to take down direct quotes sparingly, and to put things in their own words instead. Many university writing labs provide training in the art of paraphrasing. One concern is that direct quotes lend themselves to witting or unwitting plagiarism, especially if the paper is being written the night before it's due.

I've always found paraphrasing to be an unsatisfactory exercise because it is in direct tension with close reading. You read the original passage carefully, set it to one side, and then write out the ideas in your own words. At that point you're supposed to re-read the original passage and make sure that you captured the essence. Of course you didn't. As Mark Twain once said, "The difference between the almost-right word & the right word is really a large matter -- it's the difference between the lightning-bug and the lightning." [*] If a student came to me with this example, I'd tell them that there are times when you really should quote rather than paraphrase.

In fact, when I'm taking notes, I usually write down a lot of direct quotes. When I go back to them later, I find that the author's exact words serve as much better reminders of his or her work than paraphrases do. And when I write my first draft of anything, I usually have a lot more quotes than I'm going to want to have in the final version. I know that I'm going to re-read and re-write each passage dozens of times, and that all but the best quotes will be squeezed out in the process.

The problem of putting something in your own words is paralleled in machine learning by a problem known as overfitting. Suppose you work on the production line of a company that makes delicious little chocolates with multi-colored candy shells [Cdn|US]. Even though all of the candies taste the same, your company has come to the conclusion that people pay attention to the color ... they have marketing campaigns based on a preference for eating the red ones last, or the ability to customize the color, or whatever. Your job is to look at the candies as they go by and sort them by color, tossing out any that don't match one of the approved shades. (Sometimes the coloring machine malfunctions and you end up with colors that are more appropriate to your competitor.) Now any hacker in this situation is going to build a robot, so you do. As the candies come down the line, the robot tries to sort them and you provide feedback. If you don't provide enough training, the robot might decide that all of the candies are either blue or red. It is right some of the time, but not enough. That is known as underfitting. If you provide it with too much training on a limited set of examples, it might be correct 100 percent of the time for those examples, but at the cost of memorizing too much detail. Suppose you see five candies in a row, and categorize each as blue. To simplify quite a bit, things that we call "blue" have a wavelength around 475 nanometers. Your robot, however, comes up with five very specific rules: IF WAVELENGTH = 460.83429nm THEN COLOR = blue; IF WAVELENGTH = 483.00089nm THEN COLOR = blue; and so on. Once you turn it loose on a new batch of candies, it is going to start malfunctioning, because it learned too much detail about your original set of examples. It doesn't know what to do if the wavelength is 460.84000nm. This is the problem of overfitting. Now there are a lot of sophisticated methods for avoiding these problems if you are forced to model a limited data set. But the best way to avoid them is to use a lot of training data.

Which brings us back to putting things in your own words. The problem that students encounter with note-taking doesn't have as much to do with quoting vs. paraphrasing as you might think. The problem has to do with not looking at enough sources. If you only consult a handful of sources, then direct quoting might lead you to plagiarism, which would be a case of overfitting. If you paraphrase a handful of sources instead, you may avoid plagiarism but your essay isn't going to be any more nuanced. That is going to lead to underfitting. Either way, a model of a small number of sources is bound to be a bad predictor for the sources that you didn't consult. The only way out is to read more... a lot more. (See my earlier post on "The Difference That Makes a Difference.")

Tags: | | | |

Saturday, July 21, 2007

Import-Export Specialists

James Clifford once said in an interview that he "often function[s] as a kind of import-export specialist between the disciplines" [On the Edges of Anthropology, 55]. I think it's a great description of a particular kind of academic work: finding an idea, tool or technique that is well understood in one context and putting it to use in another. It has particular relevance for the practice of public history.

While thinking about ways of enriching historical practice with digital sources and computation, I've had a lot of occasion to draw on programming, machine learning, and statistical linguistics. In part, these choices reflect my own interests and training before I became a historian. More than that, they're pretty obvious places to look for inspiration. In many ways, digital history is still very textual. It highlights the act of reading, most tools are designed to augment reading or serve as surrogates for it, and outputs are almost always textual in turn. This is as it should be. Most historians (myself included) love to read. Academic history will remain a primarily textual discipline for the foreseeable future.

As I've begun to explore the idea of creating devices and environments that convey a more ambient sense of the past, however, I've had to look a bit further afield for my imports, finding many opportunities to learn from people involved in interaction design, robotics, performance and electronic music. These scholars are often disciplinary import-export specialists in their own right. If you have some time this summer to spend hacking history appliances, here are some good starting points.

Interaction design. Try Bill Moggridge's Designing Interactions and Dan Saffer's Designing for Interaction.

Robotics. The behavior-based approach of Rodney Brooks and his colleagues starts with simple but fully functional creatures interacting with the real world. More complicated systems are built by adding layers of control which subsume lower-level functionality. This strategy lends itself to designing robust interactions between people and history appliances, as I will show in detail in a future post. The related Junkbots, Bugbots and Bots on Wheels is a good source of ideas and techniques.

I also really enjoy reading the blog of Ashish Derhgawen, who comes up with some very creative hacks on a fairly limited budget. This summer he's already figured out a way to use his cellphone as a remote door opener, written a program that can play the classic video game Pong by watching the screen with a webcam, and given one of his robots the ability to respond to claps and whistles.

Performance. The best book that I've found so far for hooking up sensors and actuators to your computer is Tom Igoe and Dan O'Sullivan's Physical Computing. Both of the authors are associated with NYU's Interactive Telecommunications Program, and their focus on live events makes their work particularly useful for people who want to design experiences. The fact that they usually teach artists rather than engineers makes for a very readable work. Igoe's physical computing website is also a great resource.

Electronica. I like listening to electronic music, but hadn't learned anything about it until quite recently. What I've read about its history suggests that it is quite common for electronic musicians to spend a fair amount of their time building new instruments and exploring their creative possibilities. The Cycling '74 website has an interesting collection of resources, including videos, interviews and tutorials. The Create Digital Music webzine is also full of useful stuff. For me, electronica is Ultima Thule: so far out there that I have a hard time finding my most trusted landmarks (i.e., good books on the subject). Pinch and Trocco's Analog Days is an exception.

Tags: | | | | |

Thursday, July 19, 2007

History Appliances: Spöka

On a recent trip to Ikea I came across this awesome little dude. They're selling Spöka as "children's lighting," but it was pretty clear to me that it was one hack short of a history appliance. It has a rechargeable battery, so that you can use it without it being plugged in. If you slide off the rubber skin, there is a light-bulb-shaped plastic housing inside.

The designer thoughtfully created a case which can be opened into three parts and reassembled with nothing more than a small screwdriver.

On the top you'll find a simple push button toggle to turn it on and off.

We want to be able to control the light with the computer, however, so I interrupted the power supply by cutting the circuit to the battery and soldering in a pair of wires (the blue ones). I put a bit of heat-shrink tubing over the joints to make them more resilient. I also knotted the wires to provide strain relief where they will emerge from the case.

When the case is reassembled, the wires can be fed out of the top of the hole where the recharging plug goes in.

After you slide the rubber skin back on, you have an LED-lamp that can be controlled by your computer. If you want to wire it up directly, you might use your parallel port, like Eric Wilhelm does for the haunted house controller in Make volume 3. Instead, I incorporated it into my standard history appliance rig, which uses Phidgets controlled by Max/MSP.

For a quick demo project, I created a browser that lets me look through historic newspaper articles about séances from the online Globe and Mail archive. While browsing the stories from a particular time period, Spöka flashes gently in the background, faster if there are a lot of them, slower if not. It provides a nice peripheral feel for the intensity of Spiritualist activity at that point in time.

Tags: | | | |

Monday, July 02, 2007

Search Refinement with Compression

A few days ago I described a way of using Cilibrasi and Vitányi's Normalized Compression Distance (NCD) to automatically cluster bibliographic entries from the online Dictionary of Canadian Biography. A compression algorithm keeps track of redundancies when it is compressing a string. If those redundancies also occur in another string, then the two strings have something in common (i.e., the redundancies). The NCD ranges from 0 (if the two strings are identical) to 1 (if there is absolutely no overlap). Details are in the original article and laid out in one of my earlier posts.

Compression can also be used to automatically refine searches. Suppose you are interested in the explorer Martin Frobisher. If you type "Frobisher" into Yahoo! some of the first few pages of hits are relevant and some are not. Usually you have to wade through the results (or specify more search keywords and hope you don't eliminate something interesting by being too specific.)

An alternate strategy is to enter a broad search keyword (e.g., "Frobisher") and use the NCD to automatically compare the summary that Yahoo! returns for each hit with a "probe" text such as Frobisher's DCB entry. A short Python program to do exactly that is listed here. The search engine results can then be ranked according to increasing NCD from the probe text.

The figure below shows the first 31 of 50 hits for "Frobisher" before and after this search refinement process. I used red font to indicate the irrelevant results. As can be seen, this use of compression and a probe text does a good job of floating the relevant hits to the top of the pile.

Tags: | | | | | |