We can think of each of these words as a dimension in a highly multidimensional space. This is not immediately intuitive, so it helps to work up to the idea with simpler examples. Suppose we have two people, Samuel de Champlain and Sir Samuel Argall. In Champlain's biography, it mentions Montreal, but in Argall's biography it does not. If every biography that mentions Montreal is coded with a '1' and every biography that doesn't is coded with a '0', then we have a one-dimensional space. Every biography in this space will either be located at '0' or '1'. In fact, 196 of the 592 biographies in Volume 1 do mention Montreal, and the other 396 do not. In some sense, we want to say that those biographies that do mention Montreal are closer to one another (at least along that dimension) than the ones that do not.
But what if we take another word into account, like "Jesuit(s)"? We now have two dimensions and four possibilities: a biography might mention both Montreal and Jesuit(s), it might mention one but not the other, or it might mention neither. In fact, Jesuits are mentioned in both Champlain's and Argall's biographies. Neither Montreal nor Jesuits are mentioned in the biography of John Abraham, however.
In some sense, again, we would like to say that the biography of Champlain is a bit closer to that of Argall (from which it differs along one dimension) than it is to that of Abraham (from which it differs along two dimensions). We would also like to say that Argall's biography is the same distance from each of the other two. This is shown in the following table.
Name | Jesuit(s) | Montreal |
Abraham, John | 0 | 0 |
Argall, Sir Samuel | 1 | 0 |
Champlain, Samuel de | 1 | 1 |
We can go on adding words in this fashion, each becoming a dimension along which biographies can differ. (It becomes hard to visualize these dimensions once you have more than three, of course).
Given the biographies that we already downloaded and the 65 words in our text file, we can write a short hack to create a 65-dimensional feature space and locate each biography within it. We output the results as a spreadsheet.
Next we will want to formalize the idea of distances in our feature space...
Tags: concordance | dictionary of canadian biography | digital history | feature space | hacking | perl | text mining