In my earlier post I selected a handful of entries from the Dictionary of Canadian Biography and submitted them to an open source clustering program that Cilibrasi and Vitányi had provided. I chose the people that I did because I already had some idea of how I would group them myself. The results of the automated clustering were very encouraging, but I hadn't had a chance to follow up with compression-based clustering until now.
This time around, I decided to do a more extensive test. I wrote a Python program to randomly select 100 biographies from volume 1 of the DCB and compute the NCD between each pair. A second Python program combed through the output file of the first to automatically create a Graphviz script that plots all connections below a user-specified threshold. The resulting graph is shown below for NCDs < 0.77.
data:image/s3,"s3://crabby-images/b98c5/b98c508513b0fe3511871bc5c8f33455a25a9dd2" alt=""
Looking at different parts of the figure in turn, it becomes clear that this is a remarkably powerful technique, especially considering that the code is very simple and the algorithm has no domain-specific knowledge whatsoever. It knows nothing about history or about the English language, and yet it is able to find connections among biographical entries that are meaningful to a human interpreter.
One isolated chain of biographies consists of people active in Hudson Bay in the late seventeenth century.
data:image/s3,"s3://crabby-images/a3861/a3861a8319075aceeb87faca02558d4eb8e6a65f" alt=""
A second isolated cluster consists mostly of Englishmen who settled on the coasts of Newfoundland in the early 1600s.
data:image/s3,"s3://crabby-images/26501/2650191720dc8b901d4c9d3fdb44d73114e3f626" alt=""
The main cluster is composed almost entirely of people based in various parts of Quebec in the seventeenth century. There is an arm of Acadian settlers, some of whom spent time in Quebec.
data:image/s3,"s3://crabby-images/4d6e5/4d6e558ce0addfc6ce0579e1307b6e4451f63a2f" alt=""
There is a somewhat puzzling arm that I haven't really figured out.
data:image/s3,"s3://crabby-images/7b024/7b0246b377d8e8e43adb884b71aae552a467488d" alt=""
And there is the main body of the network and a downward projection that are comprised entirely of seventeenth-century Québécois.
data:image/s3,"s3://crabby-images/e3af6/e3af6ea05b78a8c9b821d675c2c5f0c669fad267" alt=""
data:image/s3,"s3://crabby-images/5e436/5e4363a6ddce542956e11951c44c155a50ea206f" alt=""
One benefit of this method is the incredible speed with which it executes. It took only seconds to calculate 4,950 NCDs; clustering the entire DCB would require computing 35,427,153 distance measures and would take less than a day to run on my inexpensive home computer. I'll save that hack for another time.
Tags: data compression | data mining | Dictionary of Canadian Biography | digital history | Kolmogorov complexity | visualization