Sunday, May 04, 2008

The Programming Historian is Now Available

The Programming Historian is now available on the NiCHE: Network in Canadian History & Environment website. This work is an open-access introduction to programming in Python, aimed at working historians (and other humanists) with little previous experience. Introductory lessons teach you how to

  • install Zotero, the Python programming language and other useful tools
  • read and write data files
  • save web pages and automatically extract information from them
  • count word frequencies
  • remove stop words
  • automatically refine searches
  • make n-gram dictionaries
  • create keyword-in-context (KWIC) displays
  • make tag clouds, and
  • harvest sets of hyperlinks
The Programming Historian is a work-in-progress. We are constantly adding new material, much of it driven by reader request. Upcoming topics will include indexing, scraping projects, simple spiders, mashups and much more.

Tags: | | | | | |

Thursday, April 10, 2008

Fitness Functions

[Cross-posted to Cliopatria & Digital History Hacks]

One of the distinctions that applied mathematicians make is between linear and nonlinear problems. In a linear problem, you have a set of variables that you can tweak, and as you adjust each variable you can get ever closer to an optimal configuration. Using techniques such as linear programming, it is straightforward to determine precisely how many scoops of raisins to put in your box of bran, or how many Cherries will make a Garcia. Many problems, alas, don't admit of this kind of solution. In the days before digital everything, it was all too common to futz around with the brightness knob, color balance, rabbit ears, and position of pets and small children to try and get a TV signal that didn't look like it was being relayed from the dark side of the moon. The slightest change could make things drastically better or worse, with no apparent logic.

The problem with nonlinear problems is that you pretty much have to get every variable right at the same time. Think of the space of all possible states of your problem as a kind of dark landscape, and the optimal solution as the highest point in that space. Linear problems have smooth landscapes. If you start groping your way up a hill, you end up at the top and that's the best you can do overall. Nonlinear problems have jagged landscapes. It is easy to feel your way up a low peak and get stuck there, unaware of higher peaks elsewhere.

There are different methods for solving nonlinear optimization problems; one of the more popular makes use of genetic algorithms. First you find a way of representing all of the possible solutions to your problem. In the TV example, you might want to represent the angle of each of the two antennas, the xy coordinates of the napping cat, the rotational angle of the brightness knob, and so on. A list of each of these variables is known as a genome, and a list of particular values as a genotype. Generate a small random population of genotypes, and test each one to see how good it is. This test is called the fitness function. In our example, it is the person sitting on the couch shouting "not bad," "pretty good" or "awful" each time an adjustment is made. Once you know how well each of your solutions performed, you make a new generation of solutions by mutating and recombining the genomes of your old ones. Over time, the fitness of the population increases, and the artificial selection mechanism eventually finds solutions that are near optimal. (If you want to start programming your own GAs, I recommend Mitchell's Introduction and Goldberg's Genetic Algorithms as good places to start).

One of the perennial tragedies of academia is that we constantly pretend that our careers or those of our students are linear optimization problems. Grades are the most obvious way that we do this. Students learn that their mark on one test is independent of their mark on another, that it is better to have a high GPA than to risk taking hard courses that interest them, that exploration and failure will usually be punished. Teachers justify marks by appealing to rubrics, bemoaning grade inflation and students "who look good on paper." Too many of us think of a good career in terms of lines on a CV, a list of so many independent accomplishments, each of which can be attained and then forgotten.

On a rainy day in 1992, I wandered into a Vancouver technical bookstore on my way home from school. I think I was probably avoiding a problem set or some other homework, as I've never been very good at doing what I should be doing rather than what I want to be doing. Anyway, I remember finding a copy of John Holland's Adaptation in Natural and Artificial Systems on the shelf of new releases and really wanting to buy it. I stood in the store holding the book for the longest time. It was more than I could afford, it was a distraction from my school work, I had a bad habit of buying books and losing interest in them. I had been doing a lot of exploring and a fair bit of failing. I finally made the decision that was, in context at least, sub-optimal. I bought the book and went home to read it rather than doing my schoolwork.

I often tell my students that they should follow their curiosity, take chances and not be afraid to fail. You never really know what whim, what chance encounter or distraction is going to change your life. In my case, I read a lot of science fiction and graphic novels and ate a lot of guacamole. I played role playing games and got married early and happily. I watched TV. I got bad grades in linear algebra and analysis, but I liked math enough to keep trying until I got better at it. And my first published work was on a subject that was novel and trendy enough that my reputation as an up-and-coming researcher outweighed my uneven transcript: genetic algorithms. It's tempting to look back at that moment in the bookstore as a crucial inflection point in my life, but that would be too linear. The choices that we make affect our fitness, but never in a way that makes it easy to assign credit or blame.

Tags: | | |

Saturday, April 05, 2008

Visualizing the Emergence of a Strategic Knowledge Cluster

In the summer of 2004, when I had just arrived at the University of Western Ontario, my new colleague Alan MacEachern invited me to join a small group that was putting together a grant application. The federal agency SSHRC had just announced funding for the design of something called 'research clusters'. At the time none of us was particularly clear what these clusters were supposed to be, and like many of the best kinds of opportunity, I don't think that SSHRC was really clear either. We eventually settled on the idea that the main task of clusters was 'knowledge mobilization', which left the matter nicely open.

Our initial grant application was successful, and five of us set to work to develop NiCHE, the Network in Canadian History & Environment / Nouvelle initiative canadienne en histoire de l'environnement. As we tried various things we kept track of activities and participants, allowing us to visualize the emergence of our research network. I should say up front that NiCHE doesn't cause research and is prohibited from directly funding research per se. Instead we find ways to facilitate research and training in environmental history broadly construed, and to mobilize the knowledge that researchers create.

One of the tools that we use for visualization is an open source package called Graphviz. We create a file that specifies entities (people, publications, field trips, etc.) and the relationships between them, then we hand off that file to Graphviz, which uses sophisticated algorithms to figure out a neat way to plot the network. We've found such visualization to be very useful, even though it can only ever show the tip of a much larger social iceberg. In our graphs, two people may be linked because they attended the same meeting or each published a chapter in a book. Our data doesn't show whether they knew each other in grad school, have a longstanding rivalry, or both secretly like Buffy the Vampire Slayer.

The original NiCHE executive group worked quite closely together. One of the interesting facts about networks is that the number of possible pairwise relations between entities grows much faster than the number of entities as the network gets larger. Two people have at most one relationship, three people can have three (AB, BC, AC), four people can have six (AB, AC, AD, BC, BD, CD). The ten possible pairwise relationships between the five of us looked like this:



One of the first things that we tried to do was provide licenses for Groove collaborative software to all of the people who were interested in joining NiCHE. For people with Windows machines the software worked very well. Unfortunately, it never really worked for people with Macs. We had to supplement Groove with other software, find suboptimal workarounds, and eventually abandon it. For a while, however, it gave us a way to interact relatively closely with NiCHE members who also happened to be tech-savvy Windows users. Our network took on a hub-and-spoke form.



To reach out to more potential participants, we formed an advisory group and held a meeting in Toronto. Instead of one hub, we now had two, with some bridging members who participated in both online and face-to-face activities.



The executive group split up to host regional meetings in other cities across Canada.



We put together an online directory so members could add information about themselves. The directory allowed us to contact people and tell them about upcoming activities. Since it was publicly accessible, the directory also allowed NiCHE members to learn more about one another.



Although adding one's name to a directory is a relatively weak form of participation, we found that many people became more active in NiCHE over time. The network seemed to extend to new participants, many of whom would then get involved in a number of subsequent projects. There is a saying in free / open source software, "contribute nothing, expect nothing." Conversely we could say that the people who contributed something to NiCHE could expect something from us. Some of them contributed articles to a special issue of the journal Environmental History. Some contributed chapters to a new textbook, Method and Meaning in Canadian Environmental History.



Subsequent activities like a summer school and a graduate student workshop brought in some new participants, and brought back many more:





When SSHRC announced a much larger grant for strategic knowledge clusters, we were able to include a version of the last figure as part of our application. (The Graphviz script that generated it is here.)

A year and half later, we're in the process of scaling up NiCHE activities by a couple of orders of magnitude. Network visualization gives us some insight into the work of a few hundred people who are loosely affiliated with NiCHE and collaborating in many different ways. We can identify people who have energy and initiative to share, and try to help them. Some provide 'bonding capital', tying tightly-linked groups closer together. Some provide 'bridging capital', mobilizing knowledge from one region or disciplinary specialization to another. We can also be more strategic about developing the connections that still need to be made, to make our network stronger and more effective. (For more about social networks, see Clay Shirky's new Here Comes Everybody.)

What is more exciting is that we are getting closer to the point where we can make these kind of tools available to everyone in NiCHE. People will be able to enter their own information about research collaborations and interests, and explore social connections within the network. It will become much easier to find joint acquaintances to make introductions or to find people with particular skills or expertise.

Tags: | | |

Tuesday, March 25, 2008

Monitoring the Backchannel

Rob MacDougall and I are putting together something new and fun for Western freshmen this coming fall, a course called "Science, Technology and Global History." Our goals are modest. We hope to cover the history of the whole enchilada from the Big Bang to the near future, while inculcating the idea that historians and scientists both need to have the same kind of critical, evidence-based habits of thought. Forget the two cultures. While Rob is figuring out how our students can work in teams online with students in South Asia, I'm left to kick back and brainstorm classroom mischief.

One of the interesting things about first year courses at our university is that the enrollment can't be capped. So we could have six students or six hundred. I've done large lectures before, and I'm not very enthusiastic about the format. I try to wave my hands a lot, because I once attended a seminar by a psychologist who studies the teaching evaluation process and he said that students rank mobile professors more highly than sessile ones. I also stopped talking every ten minutes or so to give students a chance to ask questions, but most of them seemed pretty shy. Each term, I got to know the half-dozen who did like to speak up in class.

Since I teach with a laptop and LCD projector, I've been thinking it would be fun to have a chat window running so students could provide backchannel commentary that could be seen by all. This might be something like IM or Twitter. As I was talking, I could keep an eye on the chat window and field questions that would take the class somewhere interesting. If there was a sudden storm of confusion, I could go back and unpack or repeat something. Students who read my blog could even try to amuse me by setting loose chatterbots that simulate famous historical figures. Now I suspect that some of you might be worrying that a few students would abuse the system and type obscenities or whatever. But I'm not worried, because I can always walk over to the computer and close the chat window. It's that easy. I figure that if you treat people like adults they respond in kind.

I'd be happy to hear from anyone who has tried something like this.

Tags: | |

Sunday, March 23, 2008

A Lunchtime Chat

There is a question that I'm told is popular to ask incoming freshmen: "Which historical figure (Jesus, Gandhi, Ozzy, etc.) would you most like to have lunch with and why?" Now I have no idea what quality in the student this question is supposed to elicit, except perhaps forbearance. I'm glad that no one ever tried it out on me, because most of the answers that occur to me--"Is that likely to happen if I decide to attend this school, sir?"--probably wouldn't help my case. When the list of candidates is specified in advance, they're typically chosen either because they are (in)famous icons of recent pop culture or because they are timeless sages who have already provided written answers to the most common set of meaning-of-life-style questions. As much as I might rather meet Lao Tzu than Elvis, my hunch is that it would be more in keeping with Taoist principles to dine with someone who speaks your language and shares your preference for Southern fried cooking. I could be wrong about that.

The whole dining with the stars thing puts me in mind of the Turing test. Alan Turing famously argued that we'd know that a computer was intelligent when its conversational interaction was indistinguishable from a person. Because people and computers look differently (android fantasies notwithstanding) he suggested a situation that would cloak the embodiment of the interlocutor. The person who is conducting the test takes turns asking questions of two different respondents via a low-bandwidth connection (think IM). If he or she can tell which one is the computer, it fails the Turing test.

In 1966, Joseph Weizenbaum created a conversational program called Eliza. Eliza could read an incoming statement like "I hate dogs" and use simple transformational grammar to turn it into a question "Why do you hate dogs?" It could offer noncommittal responses like "Please go on." If the person answered a question with "Yes," Eliza might say "You seem positive." Many people interacted with Eliza enthusiastically, leading some to say the Turing Test had already been passed and others to say that it was rubbish. (If you'd like to converse with Eliza you can Google for one of her many incarnations.)

If I were chatting with freshmen, say over lunch, I'd be looking for students who had heard of Eliza and the Turing test and had a well-developed sense of anachronism. That hasn't happened to me yet. As a public service, I'm going to offer a new question that has been updated for the digital humanities: "What challenges would you encounter when trying to create an Eliza-style simulation of each of the following historical figures? Which would be most or least likely to pass a Turing test and why?"

Tags:

Monday, March 10, 2008

Pupation

Every so often in the past few decades I've had to go through my accumulated collections of code and text and binaries and try to translate them so that they could be used on a new platform or new version of an operating system. In some cases, such as text files, it's always been quite easy. In others, it has been more difficult, or even impossible. The assembly language that I wrote for one chip, for example, won't run on any other. The KnowledgeMan database programming that I did in the 1980s dates me, but otherwise isn't of much use now. More poignantly, KMan doesn't even have its own page in Wikipedia. Now I'm in the process of moving all of my files to an open source revision-control system (more on that in a later post) and face many familiar problems. Once again, I'm discovering that open formats are a really good idea, and that in thirty years--if I last that long--the only sources that I will have to look back on my work right now may be text, XML and source code.

As I go through my files this time around, however, there are a lot of notes from writing my dissertation and publishing it. I'm reminded that I've created a few new careers by metabolizing a succession of older ones and metamorphosing into something different. And when I look through my archival notes and book notes and lists of ideas and questions, I see that most of my work didn't end up in the published book. Some of it was tangential, some was forgotten, some better forgotten.

I'm thinking a lot about the computational tools that historians might use to write different kinds of history. In methodological guides, the emphasis is always on keeping track of things, on proper notetaking and proper citation, so that you don't forget where something came from. Working with digitized sources makes it much easier to search and cite and archive, and easier to imagine that almost everything can be saved. But what if some projects are crucially dependent on a period of forgetting and reuse? What kind of tool would allow some sources to be lost, remake your tangents into something new, turn your caterpillar into a butterfly or a moth?

Tags: | | |

Thursday, March 06, 2008

A Cure for Continuous Partial Attention

On my way home the other night I noticed that the lead story in one of the university student newspapers was headlined "Frustrated profs consider laptop ban." This is one of those perennial favorites. Students seem distracted? Cut off their wireless, ban laptops and smart phones, and forbid internet use for coursework. After all, everyone knows that students always paid respectful attention to their teachers before computer and wireless internet use became widespread. The part of the article that made me laugh the hardest was a quote from an anonymous professor who complained that one student was typing into a laptop furiously for no reason. How hard must that class suck, if the prof thinks that nothing noteworthy was going on? And wouldn't you feel stupid if your inattentive student was brainstorming a cure for cancer? For their part, the students interviewed for the story mostly seemed to think that laptop use was actually helping them to learn and to prepare for their futures.

Really, shouldn't we be worried about the digital divide, rather than trying to exacerbate it? As Manuel Castells argues in The Internet Galaxy, a lack of access to networked devices is only one part of the problem. One of the fundamental challenges for a network society is


the installation of information-processing and knowledge-generation capacity in every one of us--and particularly in every child. By this I obviously do not mean literacy in using the Internet in its evolving forms (this is presupposed). I mean education. But in its broader, fundamental sense; that is, to acquire the intellectual capacity of learning to learn throughout one's whole life, retrieving the information that is digitally stored, recombining it, and using it to produce knowledge for whatever purpose we want. This simple statement calls into question the entire education system developed during the industrial era. (277-78)

A student's freedom to think their own thoughts, to structure their own mental activity, is a far greater good than trying to compel some semblance of attention. So here's a suggestion for all you frustrated profs: relax. I'm guessing that you may have spent some of your own undergraduate hours daydreaming, doodling or writing snarky notes in the margins of your notebooks. And look how well you turned out!