As before, we are working with Charles William Colby's The Fighting Governor: A Chronicle of Frontenac (1915) from Project Gutenberg. We will also be using a list of stop words posted online by computer science researchers at Glasgow. For a specific research project, we would want to tune this list, but it is fine for demo purposes. As before, we read both of our text files into lists of words.
textwords = open('cca0710-trimmed.txt', 'r').read().split()
stopwords = open('stop_words.txt', 'r').read().split()
Now we have two options. If we want to maintain some semblance of context, we can replace each stop word with a marker. The following code does exactly that.
markeredtext = []
for t in textwords:
if t.lower() in stopwords:
markeredtext.append('*')
else:
markeredtext.append(t)
Before replacing stopwords with markers, a sentence from our file looks like this:
['In', 'short,', 'the', 'divine', 'right', 'of', 'the', 'king', 'to', 'rule', 'over', 'his', 'people', 'was', 'proclaimed', 'as', 'loudly', 'in', 'the', 'colony', 'as', 'in', 'the', 'motherland.']
Afterwards, it looks like this:
['*', 'short,', '*', 'divine', 'right', '*', '*', 'king', '*', 'rule', '*', '*', 'people', '*', 'proclaimed', '*', 'loudly', '*', '*', 'colony', '*', '*', '*', 'motherland.']
For other applications, we may not need to use markers. To simply delete stopwords, we can use the following code instead:
filteredtext = [t for t in textwords if t.lower() not in stopwords]
This is what the same sentence looks like when stop words are deleted:
['short,', 'divine', 'right', 'king', 'rule', 'people', 'proclaimed', 'loudly', 'colony', 'motherland.']
Tags: programming | python | text mining