A few days ago I described a way of using Cilibrasi and Vitányi's Normalized Compression Distance (NCD) to automatically cluster bibliographic entries from the online Dictionary of Canadian Biography. A compression algorithm keeps track of redundancies when it is compressing a string. If those redundancies also occur in another string, then the two strings have something in common (i.e., the redundancies). The NCD ranges from 0 (if the two strings are identical) to 1 (if there is absolutely no overlap). Details are in the original article and laid out in one of my earlier posts.
Compression can also be used to automatically refine searches. Suppose you are interested in the explorer Martin Frobisher. If you type "Frobisher" into Yahoo! some of the first few pages of hits are relevant and some are not. Usually you have to wade through the results (or specify more search keywords and hope you don't eliminate something interesting by being too specific.)
An alternate strategy is to enter a broad search keyword (e.g., "Frobisher") and use the NCD to automatically compare the summary that Yahoo! returns for each hit with a "probe" text such as Frobisher's DCB entry. A short Python program to do exactly that is listed here. The search engine results can then be ranked according to increasing NCD from the probe text.
The figure below shows the first 31 of 50 hits for "Frobisher" before and after this search refinement process. I used red font to indicate the irrelevant results. As can be seen, this use of compression and a probe text does a good job of floating the relevant hits to the top of the pile.
Tags: data compression | data mining | Dictionary of Canadian Biography | digital history | Kolmogorov complexity | search | visualization