We've been using the digital history reading list as the basis for a series of hacks that make use of some of the features of the Amazon API to explore the conceptual space around a sample bibliography. In the first post in the series, we spidered customer recommendations to find other books that might be of interest. In the second post, we visualized the network of recommendations using the freely-available Graphviz package. One of the things that we noticed was that there was a tight cluster of digital humanities 'classics' published between 1991 and 2002, bound together by recommendations but not linked into the larger network. This suggested that we might find temporal strata in recommendations ... that is, that books of a particular era might be linked to one another by customer recommendations, but not linked to books published much earlier or later.
Today we will explore that hypothesis a bit further. We don't have nearly enough data to make any claims about customer recommendations in general, but this is supposed to be exploratory hacking. We're looking for phenomena that might be of interest, for studies we might want to undertake later on large data sets.
As every programmer knows, most of the trick to solving a problem algorithmically is to represent the data in a way that makes the answer easy to find. At the conclusion of our last hack, we had a long list of Amazon Standard ID Number (ASIN) pairs representing recommendations: "if you liked that, you'll like this." Now what we need to do is to submit each ASIN to the Amazon API and get the publication date. Then we will have a long list of date pairs: "if you liked that book published in year x, you'll like this book published in year y." We then transform those pairs into a matrix. Some quick-and-dirty Python source to do most of the work is here.
Now, for any given year that books on our list were published, we can see how many recommendations were for books published earlier, how many for books published the same year, and how many for books published later. If you look at the figure above, I've put boxes around the diagonal which represents the case where both books were published the same year. Everything to the left of that boxed cell is a recommendation that was published earlier; everything to the right is a recommendation that was published later.
Without too much generalization, we can say that it appears that the recommendations for books published earlier tend to be more spread out in time than the ones for those published more recently. This may reflect the fact that those books have had more time to take their place in the literature, or it may be due to the fact that Amazon hasn't been collecting data for very long and has been growing during that time, or to some other factor(s).
We want to be a bit careful about comparing different rows in the matrix, because most of the books that are on the digital history reading list were published relatively recently (many in 2006). So we can't look at the blob of color in the lower right hand corner, for example, and conclude that most recommendations are to recent books. Instead what we have to do is normalize our data by determining what proportion of recommended books were published earlier, the same year as, and later than a given year. That is shown in the figure below.
It appears that recommendations for earlier books tend to be to those published later, and recommendations for later books tend to be to those published earlier. And finally, it appears that a growing proportion of recommendations are to books published the same year. The fact that the year 2001 seems to be an exception to this trend may be worth investigating.
Tags: Amazon | application program interface | bibliography | hacking | python | visualization