In a series of thought-provoking posts, Steven Shaviro has been putting up "fragments, excerpts, and outtakes" from a book that he is writing called Age of Aesthetics. In the most recent, "Retconning History," Shaviro takes on the implications of history as "a vast database of 'information'." He argues that we now tend to understand history as an algorithmic search through a constraint space, much like the computer game Civilization. Furthermore, this is not due to information technology as much as to commodity culture. "The market mechanism defines our possibilities in the present, and colonizes our hopes and dreams for the future," he writes, "so it’s scarcely surprising that it remakes the past in its own image as well."
Shaviro goes on to argue that the past, the object of history, is retroactively changed in the present, a process that he likens to the retcons that are familiar from some forms of literature. As examples, he gives Ursula K. LeGuin's novel The Lathe of Heaven, where the protagonist's dreams change the past for other characters in the story (although he remembers the earlier reality) and season 5 of the TV show Buffy the Vampire Slayer, when a new character is suddenly introduced as if she had been there all along. "But is this not the way that History actually works?" he asks. "It’s a commonplace that history is written by the victors. The past is never secure from the future." Shaviro's wider point is that "capitalism is, among other things, a vast machine for imposing its own retrospective continuity upon everything it encounters." We imagine that it is possible to write history in all-encompassing terms, and project rational, economic man back to the origin of the species.
It's a pretty bleak picture. The idea of retconning, however, suggests a way out of the matrix. Like irony, retconning only works if the reader/interpreter is able to maintain a kind of double vision. Something is ironic if we are simultaneously aware of its expected reading and our deflection from that reading. Something has been retconned if we are simultaneously aware of the past that previously was, and the past we are now presented with. Readers of The Lathe of Heaven, or viewers of Buffy, understand that the past of the story has been changed precisely because they remember how it used to be. If they didn't remember, it would be as if the change had never occurred.
Historians, too, are constantly revising their interpretation of the past. (Here I would disagree with Shaviro by saying that it is not the past that is retroactively changed, but history. The past is gone, inaccessible, whereas history is what we make of it.) That constant revision is understood to be revisionist precisely because we remember what earlier interpretations were. Unlike Civilization, each new interpretation becomes part of the archive, and changes the rules of the game.
Tags: computer games | digital history | feature space | historiography | simulation
Wednesday, June 28, 2006
Monday, June 26, 2006
Setting up a Server
Up until now, all of the hacks that I've posted here have been client-side and somewhat labor-intensive. If you want to try it yourself (and that, after all, is the point of hacking) then you have to set up Perl on your own computer, download the appropriate files and start modifying them.
For some time, I've been planning to buy and set up a server, and the recent Doing Digital History workshop at the Center for History and New Media at GMU gave me the impetus to do so. So I ordered a server, and, in the meantime, have been doing test installations of server software on my home machine, so I have some idea of what to expect. The new server will be a not-very-expensive Windows XP Pro machine with 2Gb of RAM and two 320Gb hard drives. I will be using open source software whenever possible. After three days of installing, testing, making mistakes, uninstalling, and trying again, I've come up with the following sequence.
1. Install XAMPP. In order to run a minimal server, you're going to need a webserver (Apache), a database server (MySQL), FTP client and server (FileZilla), a way to make dynamically generated web pages (PHP) and an all-purpose programming language (Perl). In my first few attempts, I tried installing these packages separately and then trying to get them to work together. This turned out to be less straightforward than I had hoped, and I was in the process of trying to debug my installation when I discovered that it is possible to install everything you need in one pass. This is not only much faster, but everything works together as advertised. Once the basic XAMPP package is installed, you will want to install the Perl and mod_perl addons, and fix potential security holes. Before the following step, you will also need to enable CURL for PHP.
2. Install WordPress. At the workshop, I was very impressed with the way that Josh Greenberg was able to quickly set up a workshop website, blog, syllabus, links to readings, individual blogs for class members, group blog feed and wiki. The magic behind this was WordPress, a popular open source package. If CURL is enabled for PHP (see step 1) it is very easy to import your existing blogs from Blogger and other popular sites. There are also a number of different themes available to change the look and feel of your site or of various parts of it. After playing around with Alex King's Theme Browser, I downloaded about a dozen themes for further experimentation.
3. Install Streetprint Engine and/or Greenstone Digital Library Software. I want our public and digital history students to be able to create online repositories with a minimum of effort. These two packages both worked very well in my test installations; the former is easier to use and more oriented toward page-at-a-time display of images or artifacts, while the latter is oriented more toward libraries. I will probably install both and use whichever is more appropriate to a particular project.
4. (Optional) Install content management system. I experimented with Joomla! and Drupal and decided not to install a CMS, at least not yet. Both systems allow users to create database-driven websites without programming, but I figure that WordPress will give me most of the functionality that I need. Besides, I want my students to program!
More once the server is up and running...
Tags: blogs | digital history | hacking | open source | perl | relational databases | server
For some time, I've been planning to buy and set up a server, and the recent Doing Digital History workshop at the Center for History and New Media at GMU gave me the impetus to do so. So I ordered a server, and, in the meantime, have been doing test installations of server software on my home machine, so I have some idea of what to expect. The new server will be a not-very-expensive Windows XP Pro machine with 2Gb of RAM and two 320Gb hard drives. I will be using open source software whenever possible. After three days of installing, testing, making mistakes, uninstalling, and trying again, I've come up with the following sequence.
1. Install XAMPP. In order to run a minimal server, you're going to need a webserver (Apache), a database server (MySQL), FTP client and server (FileZilla), a way to make dynamically generated web pages (PHP) and an all-purpose programming language (Perl). In my first few attempts, I tried installing these packages separately and then trying to get them to work together. This turned out to be less straightforward than I had hoped, and I was in the process of trying to debug my installation when I discovered that it is possible to install everything you need in one pass. This is not only much faster, but everything works together as advertised. Once the basic XAMPP package is installed, you will want to install the Perl and mod_perl addons, and fix potential security holes. Before the following step, you will also need to enable CURL for PHP.
2. Install WordPress. At the workshop, I was very impressed with the way that Josh Greenberg was able to quickly set up a workshop website, blog, syllabus, links to readings, individual blogs for class members, group blog feed and wiki. The magic behind this was WordPress, a popular open source package. If CURL is enabled for PHP (see step 1) it is very easy to import your existing blogs from Blogger and other popular sites. There are also a number of different themes available to change the look and feel of your site or of various parts of it. After playing around with Alex King's Theme Browser, I downloaded about a dozen themes for further experimentation.
3. Install Streetprint Engine and/or Greenstone Digital Library Software. I want our public and digital history students to be able to create online repositories with a minimum of effort. These two packages both worked very well in my test installations; the former is easier to use and more oriented toward page-at-a-time display of images or artifacts, while the latter is oriented more toward libraries. I will probably install both and use whichever is more appropriate to a particular project.
4. (Optional) Install content management system. I experimented with Joomla! and Drupal and decided not to install a CMS, at least not yet. Both systems allow users to create database-driven websites without programming, but I figure that WordPress will give me most of the functionality that I need. Besides, I want my students to program!
More once the server is up and running...
Tags: blogs | digital history | hacking | open source | perl | relational databases | server
Wednesday, June 21, 2006
A Roundup of Digital History Blogs
There aren't quite enough digital history blogs to justify a carnival yet, but there is certainly enough interesting work going on to try and gather a collection of links about digital history and of use to digital historians. If I've missed you or someone you know, please e-mail me at wturkel@uwo.ca and I will include a link in the next roundup. Without further ado...
Jeff Barry, "Endless Hybrids." Jeff blogs about libraries and the remix culture.
John Battelle, "John Battelle's Searchblog." Everything Google, and more.
Jeremy Boggs, ClioWeb. Jeremy is a web designer at the Center for History and New Media, and a PhD candidate at GMU. He blogs about the intersection of history, new media and design. You can also catch him at the new group blog Revise and Dissent.
Tara Calishain, "ResearchBuzz." A great place to find up-to-the-minute information about internet research.
Dan Cohen. Co-author of Digital History (U Penn, 2005), Dan writes an eponymous blog about digital humanities, Google and searching, and "programming for poets." He is also the author of GMU's clever H-Bot.
Tom Daccord, "THWT Edublogger." Tom blogs about teaching humanities with technology.
Lorcan Dempsey, "Lorcan Dempsey's Weblog." Lorcan is the vice-president and chief strategist at OCLC. He blogs about libraries, services and networks.
Brian Downey, "Behind AotW." Brian is the webmaster of the site Antietam on the Web. Here in the so-called "backwash of a digital history project," he blogs on the nature of digital history.
Josh Greenberg, Epistemographer. Josh, Associate Director of Research Projects at CHNM, brings an STS perspective to digital humanities and Web 2.0. He hasn't posted for an embarassing 62 days, but I'm hoping to shame him into it...
Mills Kelly, edwired. Mills blogs about teaching and learning history online. In recent posts he's taken on PowerPoint, edited Wikipedia articles (and made his students do the same), and critiqued collaborative learning and best practices in education.
Rob Macdougall, Old is the New New. Strictly speaking, Rob doesn't usually blog about digital history per se. But he loves robots, and he's been writing quality posts for years longer than most of us.
David Mattison, The Ten Thousand Year Blog. David, a BC archivist, blogs about digital preservation and libraries. Among many other categories, his "history findings" tracks an interesting and growing collection of online repositories.
Musematic. A group blog about museum informatics.
ProgrammableWeb. Because the world really is your programmable oyster.
Geoffrey Rockwell, "grockwel." Geoffrey is an Associate Professor of Multimedia at McMaster and project leader for the TAPoR project. He blogs about digital humanities in general.
Tom Scheinfeldt, Found History. Tom writes about "unintentional, unconventional, and amateur history all around us." It makes for fascinating reading, and since "all around us" includes the internet, this is often digital history at the margins.
Andrew Vande Moere, "Information Aesthetics." Andrew blogs about data visualization and visual culture. This is a great place to get ideas for presenting history in a new way.
Rebecca Woods, Past Matters. This fall, Rebecca will be starting her PhD at MIT. In the meantime, she's got a brand new blog where she can discuss learning to program and trying to text mine the archives of the Old Bailey.
Michael Yunkin, Digitize Everything. Michael is a librarian at UNLV, who is "helping dig the grave of all things analog." He blogs about digitization, search and oral history.
Updates, June 24, 2006 - July 10, 2006
Manan Ahmed, Chapati Mystery. Manan, who is working on a PhD at U Chicago, blogs about a variety of topics. His recent Polyglot Manifesto is essential reading (I, II).
Sheila Brennan, Relaxing on the Bayou. Sheila blogs about both physical and online museums.
Google has a new group blog to watch, Inside Google Book Search.
Tom Goskar, Past Thinking. Tom blogs about archaeological computing from the UK.
Paula Petrik, HistoryTalk. Among other things, Paula blogs about history, markup and web/tech. She's got a great new discussion of the problems and possibilities of XHTML and CSS for online scholarship.
Rachel "I'm Too Sexy for My Master's Thesis," blogs about new publishing models and the use of technology to aid research and writing.
Richard Urban, Inherent Vice. Richard is a PhD student in library and information science who blogs about digital cultural heritage.
Scott Weingart, a research assistant on a digital humanities project at the University of Florida has just started Ludus Historiae, to bring "the absurdities of the past to the computers of the future."
Tags: blogs | digital history
Jeff Barry, "Endless Hybrids." Jeff blogs about libraries and the remix culture.
John Battelle, "John Battelle's Searchblog." Everything Google, and more.
Jeremy Boggs, ClioWeb. Jeremy is a web designer at the Center for History and New Media, and a PhD candidate at GMU. He blogs about the intersection of history, new media and design. You can also catch him at the new group blog Revise and Dissent.
Tara Calishain, "ResearchBuzz." A great place to find up-to-the-minute information about internet research.
Dan Cohen. Co-author of Digital History (U Penn, 2005), Dan writes an eponymous blog about digital humanities, Google and searching, and "programming for poets." He is also the author of GMU's clever H-Bot.
Tom Daccord, "THWT Edublogger." Tom blogs about teaching humanities with technology.
Lorcan Dempsey, "Lorcan Dempsey's Weblog." Lorcan is the vice-president and chief strategist at OCLC. He blogs about libraries, services and networks.
Brian Downey, "Behind AotW." Brian is the webmaster of the site Antietam on the Web. Here in the so-called "backwash of a digital history project," he blogs on the nature of digital history.
Josh Greenberg, Epistemographer. Josh, Associate Director of Research Projects at CHNM, brings an STS perspective to digital humanities and Web 2.0. He hasn't posted for an embarassing 62 days, but I'm hoping to shame him into it...
Mills Kelly, edwired. Mills blogs about teaching and learning history online. In recent posts he's taken on PowerPoint, edited Wikipedia articles (and made his students do the same), and critiqued collaborative learning and best practices in education.
Rob Macdougall, Old is the New New. Strictly speaking, Rob doesn't usually blog about digital history per se. But he loves robots, and he's been writing quality posts for years longer than most of us.
David Mattison, The Ten Thousand Year Blog. David, a BC archivist, blogs about digital preservation and libraries. Among many other categories, his "history findings" tracks an interesting and growing collection of online repositories.
Musematic. A group blog about museum informatics.
ProgrammableWeb. Because the world really is your programmable oyster.
Geoffrey Rockwell, "grockwel." Geoffrey is an Associate Professor of Multimedia at McMaster and project leader for the TAPoR project. He blogs about digital humanities in general.
Tom Scheinfeldt, Found History. Tom writes about "unintentional, unconventional, and amateur history all around us." It makes for fascinating reading, and since "all around us" includes the internet, this is often digital history at the margins.
Andrew Vande Moere, "Information Aesthetics." Andrew blogs about data visualization and visual culture. This is a great place to get ideas for presenting history in a new way.
Rebecca Woods, Past Matters. This fall, Rebecca will be starting her PhD at MIT. In the meantime, she's got a brand new blog where she can discuss learning to program and trying to text mine the archives of the Old Bailey.
Michael Yunkin, Digitize Everything. Michael is a librarian at UNLV, who is "helping dig the grave of all things analog." He blogs about digitization, search and oral history.
Updates, June 24, 2006 - July 10, 2006
Manan Ahmed, Chapati Mystery. Manan, who is working on a PhD at U Chicago, blogs about a variety of topics. His recent Polyglot Manifesto is essential reading (I, II).
Sheila Brennan, Relaxing on the Bayou. Sheila blogs about both physical and online museums.
Google has a new group blog to watch, Inside Google Book Search.
Tom Goskar, Past Thinking. Tom blogs about archaeological computing from the UK.
Paula Petrik, HistoryTalk. Among other things, Paula blogs about history, markup and web/tech. She's got a great new discussion of the problems and possibilities of XHTML and CSS for online scholarship.
Rachel "I'm Too Sexy for My Master's Thesis," blogs about new publishing models and the use of technology to aid research and writing.
Richard Urban, Inherent Vice. Richard is a PhD student in library and information science who blogs about digital cultural heritage.
Scott Weingart, a research assistant on a digital humanities project at the University of Florida has just started Ludus Historiae, to bring "the absurdities of the past to the computers of the future."
Tags: blogs | digital history
Sunday, June 18, 2006
Refracted Footnotes
In my last post, I suggested that broken links, although typically perceived to be a problem, might also serve as an interesting historical source if properly mined. Today I'd like to address a related topic: the fact that the content of websites changes even while the URL remains the same. In this blog, for example, I've made a lot of references to articles in the constantly changing, user-contributed Wikipedia. This means that the article on concordances that I cited on 29 Jan 2006 has changed since I cited it ... 16 times, in fact.
The reason that I know how many times the article has changed is that Wikipedia has a 'history' page for each article that keeps track of revisions and allows the user to compare selected versions of the article. The most recent change to the article on concordances (as of this post) was to fix a misspelling of the word 'frequently'.
A sophisticated Wikipedia user consults the history page before reading the article, much the way that EH Carr recommended that we study the historian before reading his or her work. It soon becomes clear that some articles offer a relatively stable interpretation, while others are subject to ceaseless revision. Although similar processes can be observed in more traditional historical discourse, the rapidity with which Wikipedia changes, and its extensive and automatic philological apparatus make it a natural laboratory for experiments in digital history. Mills Kelly, for example, has blogged about his experiences trying to change the article on the fate of the Donner Party, widely believed to have resorted to cannibalism when snow bound in the Sierra Nevada in 1846. Mills revised Wikipedia to reflect new work in historical archaeology which calls this interpretation into question, and then watched as other people revised his changes toward a temporary consensus. Mills has also assigned the task of writing Wikipedia entries and watching them be revised to his grad students. [For more on his experiments, see What's for Dinner? (Cont'd 1) (Cont'd 2) and Whither Wiki?]
As with other forms of new media, we need to be teaching young historians how to read wikis critically and how to write them effectively. We should also be aware of the new resources that they provide for computational historiography. To take a single example, look at the "history flow visualization" created by Martin Wattenberg and Fernanda Viégas, which maps the revisions of the Wikipedia article on evolution.
So is all of this completely new? Not really. As any work is reinterpreted over time, citations to that work will change in meaning, too. Traditional historical works age along with the literature that they cite. As a physical analog, we might think of refraction: a wave bending when it enters a new medium. There's a good description of the phenomenon at Wikipedia ... at least there is right now.
Tags: bibliography | citation | data mining | digital history | historiography | public history | Wikipedia
The reason that I know how many times the article has changed is that Wikipedia has a 'history' page for each article that keeps track of revisions and allows the user to compare selected versions of the article. The most recent change to the article on concordances (as of this post) was to fix a misspelling of the word 'frequently'.
A sophisticated Wikipedia user consults the history page before reading the article, much the way that EH Carr recommended that we study the historian before reading his or her work. It soon becomes clear that some articles offer a relatively stable interpretation, while others are subject to ceaseless revision. Although similar processes can be observed in more traditional historical discourse, the rapidity with which Wikipedia changes, and its extensive and automatic philological apparatus make it a natural laboratory for experiments in digital history. Mills Kelly, for example, has blogged about his experiences trying to change the article on the fate of the Donner Party, widely believed to have resorted to cannibalism when snow bound in the Sierra Nevada in 1846. Mills revised Wikipedia to reflect new work in historical archaeology which calls this interpretation into question, and then watched as other people revised his changes toward a temporary consensus. Mills has also assigned the task of writing Wikipedia entries and watching them be revised to his grad students. [For more on his experiments, see What's for Dinner? (Cont'd 1) (Cont'd 2) and Whither Wiki?]
As with other forms of new media, we need to be teaching young historians how to read wikis critically and how to write them effectively. We should also be aware of the new resources that they provide for computational historiography. To take a single example, look at the "history flow visualization" created by Martin Wattenberg and Fernanda Viégas, which maps the revisions of the Wikipedia article on evolution.
So is all of this completely new? Not really. As any work is reinterpreted over time, citations to that work will change in meaning, too. Traditional historical works age along with the literature that they cite. As a physical analog, we might think of refraction: a wave bending when it enters a new medium. There's a good description of the phenomenon at Wikipedia ... at least there is right now.
Tags: bibliography | citation | data mining | digital history | historiography | public history | Wikipedia
Tuesday, June 13, 2006
Broken Links
Day two of CHNM's Doing Digital History workshop kicked off with a topic close to my heart: data and text mining. One of the things that we discussed was link analysis (also known as graph mining or relational data analysis), the ability to exploit connections between entities as a way of making inferences or refining searches. Google makes a different, but related, use of links in its page rank algorithm.
The discussion got me to thinking about broken links. We've all had the experience of clicking on a hypertext link and getting an HTTP 404 error, the message that the requested file cannot be found. It seems to be generally accepted that broken links are a bad thing. It is quite easy to write scripts that check each of the links on a web page and report on broken ones. If you don't want to write your own script, the W3C has an online link checker. In fact, a 2004 article in the BBC Technology News offered the hope that broken links might one day be eliminated altogether.
The article describes research done by student interns in conjunction with IBM. Their system follows working links to create a "fingerprint" of each page that is linked to. It can then determine when content of the target page changes and notify system administrators or even change the link automatically. Such a system would reduce lost productivity and prevent large corporations from getting into the embarassing position of linking to an innocuous site only to have it change into something disreputable.
So far, so good. But what if we consider broken links to be a kind of historical evidence? Existing link checkers can already spider through whole sites looking for broken links. Rather than fixing them (or before fixing them) why not compile an archive of them to study? We could ask why links get broken in the first place. Sure, some are bound to be typos. But in many cases, the target site has moved to a different address, and this continual process of renaming reflects other processes: of rebranding, search engine positioning, system or business process reorganization, and so on. We should take a page from the archaeologists' book, and pay more attention to what our middens have to tell us.
Tags: broken links | document fingerprinting | Doing Digital History workshop | link analysis
The discussion got me to thinking about broken links. We've all had the experience of clicking on a hypertext link and getting an HTTP 404 error, the message that the requested file cannot be found. It seems to be generally accepted that broken links are a bad thing. It is quite easy to write scripts that check each of the links on a web page and report on broken ones. If you don't want to write your own script, the W3C has an online link checker. In fact, a 2004 article in the BBC Technology News offered the hope that broken links might one day be eliminated altogether.
The article describes research done by student interns in conjunction with IBM. Their system follows working links to create a "fingerprint" of each page that is linked to. It can then determine when content of the target page changes and notify system administrators or even change the link automatically. Such a system would reduce lost productivity and prevent large corporations from getting into the embarassing position of linking to an innocuous site only to have it change into something disreputable.
So far, so good. But what if we consider broken links to be a kind of historical evidence? Existing link checkers can already spider through whole sites looking for broken links. Rather than fixing them (or before fixing them) why not compile an archive of them to study? We could ask why links get broken in the first place. Sure, some are bound to be typos. But in many cases, the target site has moved to a different address, and this continual process of renaming reflects other processes: of rebranding, search engine positioning, system or business process reorganization, and so on. We should take a page from the archaeologists' book, and pay more attention to what our middens have to tell us.
Tags: broken links | document fingerprinting | Doing Digital History workshop | link analysis
Monday, June 12, 2006
A Search Engine for 17th-Century Documents
One of the questions raised in the discussions today at the Doing Digital History workshop was how to make a search engine that returned documents written in a particular era. I'm assuming that we don't have access to metadata (that would be too easy). Here's the plan that I came up with. I haven't had a chance to try it yet, but it might make for some interesting hacking later on. Take the pages returned by a standard search engine, filter out any HTML or other markup tags and pass the plain text through a part-of-speech tagger to identify how each word is being used in context. After tagging, remove all of the stop words, things like 'and', 'the', 'is', etc. Now check each of the remaining words against an online etymological dictionary (like the Oxford English Dictionary) to determine the earliest attested date for each, and note whether the word has since fallen into disuse. You should end up with a vector of dates, the latest of which will put a bound on the earliest that the document could have been written. Less-common words will tend to be better indicators of date than more common ones, so it might help to take overall word frequency into account in the algorithm. The earliest date that this blog post could have been written, for example, would be bounded by the earliest attested dates of 'metadata' (1969), 'search engine' (1984), 'HTML' (1993) and 'blog post' (1999).
Tags: dictionaries | Doing Digital History workshop | etymology | search
Tags: dictionaries | Doing Digital History workshop | etymology | search
An Experimental Interface
This whole week I'm at the Doing Digital History workshop sponsored by the Center for History and New Media at George Mason University. There's a great bunch of people, a lot of interesting sessions and good ideas flying around. In short, it's nerdvana for digital historians. One of the workshop activities is regular blogging so I will try to post something here at the end of each day.
The first activity today was to self-organize into small groups and study a number of different websites, talking them over together while surfing through them. In itself, that was an interesting activity; as Josh Greenberg noted, we rarely surf as a communal or conversational practice. My favorite site was HistoryWired at the Smithsonian, which has an admittedly experimental interface that uses something like a heatmap (aka treemap) to cross-classify 450 interesting objects from the 3 million the institution has. It takes a while to get used to the HistoryWired interface, and a number of the workshop participants found it to be too visually busy and resource-intensive for their taste. What I liked about it was that the more that I played with it, the more features I discovered. It is possible to zoom into a particular region of the collection, to explore classes of artifacts with a timeline, and to get an immediate visual sense of the overlap of particular categories. Given the roots of this kind of representation in data mining and visualization, I imagine that the interface would scale up quite nicely if it were used as the front-end for a very large collection of sources. Judging from the criticism of the workshop group, I suspect that this is not a successful way to present history to the general public, but it could be a very useful exploratory tool for some kinds of research.
Tags: data mining | Doing Digital History workshop | visualization
The first activity today was to self-organize into small groups and study a number of different websites, talking them over together while surfing through them. In itself, that was an interesting activity; as Josh Greenberg noted, we rarely surf as a communal or conversational practice. My favorite site was HistoryWired at the Smithsonian, which has an admittedly experimental interface that uses something like a heatmap (aka treemap) to cross-classify 450 interesting objects from the 3 million the institution has. It takes a while to get used to the HistoryWired interface, and a number of the workshop participants found it to be too visually busy and resource-intensive for their taste. What I liked about it was that the more that I played with it, the more features I discovered. It is possible to zoom into a particular region of the collection, to explore classes of artifacts with a timeline, and to get an immediate visual sense of the overlap of particular categories. Given the roots of this kind of representation in data mining and visualization, I imagine that the interface would scale up quite nicely if it were used as the front-end for a very large collection of sources. Judging from the criticism of the workshop group, I suspect that this is not a successful way to present history to the general public, but it could be a very useful exploratory tool for some kinds of research.
Tags: data mining | Doing Digital History workshop | visualization
Sunday, June 04, 2006
Experimenting with the TAPoR Tools
This summer I'm in the process of developing a new graduate course on digital history. One of the things that we will study is the creation of online historical materials, and for this, I plan to assign Cohen and Rosenzweig's Digital History. I would also like to emphasize the new computational techniques that historians will increasingly need to use with digital sources. This raises some interesting challenges. I can't assume that my students will know how to program or that they will be familiar with markup languages like HTML or XML. We don't even really have time for the systematic exploration of a particular language, like Perl. (Although we will have time for some fun stuff.) I've decided to focus on specific problems faced by historians working in the digital realm, and show how computation makes them tractable. I'll say more about the course in future posts; for now, suffice it to say that it will teach stepwise refinement, be very hands-on and, no doubt, a bit hackish.
The beta release of the TAPoR Text Analysis Portal gives students the chance to experiment with text processing without having to code everything from scratch. It allows the user to enter the URL of a digital source and then explore the text with an interactive concordance.
For example, suppose you want to get (or convey) a sense of how the historian's job of interpretation can be augmented with computational tools. Go to the online Dictionary of Canadian Biography and choose an entry at random. I picked Robert McLaughlin, someone with whom I wasn't already familiar. Using the TAPoR tool it is possible to find the most frequently occurring distinctive words and phrases in McLaughlin's biography:
It is also possible to get information about keywords in context. For example, clicking on "carriage" returns the following:
Without reading the biography yet, I can now guess that Robert McLaughlin lived in Oshawa and founded a carriage works which became very successful. At this point, it is reasonable to object that I could have learned the same thing by reading his biography. The point, however, is that a computer can't learn by reading, but it can make use of text processing to produce more useful output. For example, suppose you wanted to create a "smarter" search engine. If you type "Robert McLaughlin" into Google, you get the following results.
Now these results have less to do with one another than the animals in Borges' "Chinese Encyclopedia". But what if your search engine was to recognize "Robert McLaughlin" as a proper name, first submit the search to the Dictionary of Canadian Biography, process the text for keywords and then submit the query "Robert McLaughlin"+oshawa+carriage to Google? Then the first ten results would look like this:
Tags: concordance | dictionary of canadian biography | digital history | history education | pedagogy | search | stepwise refinement | text mining
The beta release of the TAPoR Text Analysis Portal gives students the chance to experiment with text processing without having to code everything from scratch. It allows the user to enter the URL of a digital source and then explore the text with an interactive concordance.
For example, suppose you want to get (or convey) a sense of how the historian's job of interpretation can be augmented with computational tools. Go to the online Dictionary of Canadian Biography and choose an entry at random. I picked Robert McLaughlin, someone with whom I wasn't already familiar. Using the TAPoR tool it is possible to find the most frequently occurring distinctive words and phrases in McLaughlin's biography:
mclaughlin carriage |
in oshawa |
company |
toronto |
motor |
automobiles |
business |
It is also possible to get information about keywords in context. For example, clicking on "carriage" returns the following:
Enniskillen, where he built a | carriage | works, which, in at least |
him to build the Oshawa | Carriage | Works, a three-storey brick |
which became known as McLaughlin | Carriage | about , was facilitated by careful |
new designs (some influenced by | Carriage | Monthly, a Philadelphia journal), and |
patents (and buying others), refining | carriage | mechanisms, tabulating the credit ratings |
mostly wholesale business of McLaughlin | Carriage | is all the more impressive |
transportation. Boosted as the largest | carriage | maker in the British empire |
Without reading the biography yet, I can now guess that Robert McLaughlin lived in Oshawa and founded a carriage works which became very successful. At this point, it is reasonable to object that I could have learned the same thing by reading his biography. The point, however, is that a computer can't learn by reading, but it can make use of text processing to produce more useful output. For example, suppose you wanted to create a "smarter" search engine. If you type "Robert McLaughlin" into Google, you get the following results.
- An art gallery in Oshawa
- (ditto)
- Bible Ministries
- (ditto)
- A photographer in Glasgow
- An art gallery in Oshawa
- A book about the battle of Okinawa in WWII
- A role-playing game called "Cthulhu Live"
- Realtors in New Jersey
- The blog of a Californian graphic artist
Now these results have less to do with one another than the animals in Borges' "Chinese Encyclopedia". But what if your search engine was to recognize "Robert McLaughlin" as a proper name, first submit the search to the Dictionary of Canadian Biography, process the text for keywords and then submit the query "Robert McLaughlin"+oshawa+carriage to Google? Then the first ten results would look like this:
- A Wikipedia entry on Oshawa with information about the McLaughlin Carriage Company
- The Answers.com entry on Oshawa with information about the McLaughlin Carriage Company
- A popular history website (Mysteries of Canada) with an article about the McLaughlin Carriage Company and General Motors
- The history page of the City of Oshawa website with information about McLaughlin and his carriage company
- An art gallery in Oshawa
- The Canadian Encyclopedia entry on Oshawa with information about the McLaughlin Carriage Company
- The Oshawa Community Museums and Archives page about the McLaughlin Carriage Company
- An art gallery in Oshawa
- An article about McLaughlin from the Financial Post, reproduced by the Business Library at the University of Western Ontario
- A history page on the GM Canada website which talks about McLaughlin and his company
Tags: concordance | dictionary of canadian biography | digital history | history education | pedagogy | search | stepwise refinement | text mining