The basic problem is to split a text file into an array of words, count the number of occurrences of each word, and return a dictionary sorted by frequency. For my text, I chose Charles William Colby, The Fighting Governor: A Chronicle of Frontenac (1915) available from Project Gutenberg. We start by reading the file into one long string and then use whitespace to split the string into a list of separate words. In Python it looks like this:
input = open('cca0710-trimmed.txt', 'r')
text = input.read()
wordlist = text.split()
or like this if you want to show off:
wordlist = open('cca0710-trimmed.txt', 'r').read().split()
Now that we have our word list, the next step is to create the dictionary. We do this first by counting the number of occurrences of each word in the list:
wordfreq = [wordlist.count(p) for p in wordlist]
Then we pair each word with its corresponding frequency to create the dictionary:
dictionary = dict(zip(wordlist,wordfreq))
Now that we have the dictionary, we can sort it by inverse word frequency and print out the results:
aux = [(dictionary[key], key) for key in dictionary]
aux.sort()
aux.reverse()
for a in aux: print a
This gives us results like the following:
(2574, 'the')
(1394, 'of')
(880, 'to')
(855, 'and')
(572, 'in')
(548, 'was')
(545, 'a')
(420, 'his')
...
(213, 'for')
(212, 'Frontenac')
(209, 'by')
(194, 'not')
...
(76, 'would')
(75, 'Iroquois')
(74, 'upon')
...
(68, 'English')
(68, 'Canada')
(66, 'New')
(65, 'France')
...
Not too hard, eh?
Tags: programming | python | statistical natural language processing | text mining