Saturday, April 29, 2006

SIP Mapping

In an earlier post, I mentioned the fact that Amazon keeps track of phrases that are distinctive to a small set of books. These SIPs (statistically improbable phrases) can be used to get some idea of the conceptual landscape in and around particular works, and thus can be used to generate bibliographies. Ideally, of course, the process could be automated. If machine-readable versions of the books were available, it could also be used as part of a text mining project.

I haven't had a chance to do much programming recently so I thought I would put together a rudimentary hack to scrape SIPs and create a map. I also wanted to learn how to use the open source Graphviz visualization toolkit, so I used a Perl module to link to it. If you look at the code for the hack, you can see how simple it is to create pretty neat graphs. The figure below (1Mb) shows what happens when you start with Diamond's Guns, Germs, and Steel and follow the SIPs to adjacent books. The figure is more than 8,000 pixels wide, so you have to zoom in to see the detail ... and at that level it is pretty complicated. I will leave the implementation of a better graph browser for a future hack.



Tags: | | | | | |