Wednesday, August 23, 2006

Easy Pieces in Python: Removing Stop Words

We continue our exploration of simple Python scripting with another common problem: removing stop words. About forty percent of a given text consists of very common words like 'a', 'the', 'and', etc. While necessary to convey meaning, these words don't distinguish the text from other texts, and thus are usually not very useful for tasks like searching, indexing, determining good keywords, and so on.

As before, we are working with Charles William Colby's The Fighting Governor: A Chronicle of Frontenac (1915) from Project Gutenberg. We will also be using a list of stop words posted online by computer science researchers at Glasgow. For a specific research project, we would want to tune this list, but it is fine for demo purposes. As before, we read both of our text files into lists of words.

textwords = open('cca0710-trimmed.txt', 'r').read().split()
stopwords = open('stop_words.txt', 'r').read().split()

Now we have two options. If we want to maintain some semblance of context, we can replace each stop word with a marker. The following code does exactly that.

markeredtext = []
for t in textwords:
    if t.lower() in stopwords:

Before replacing stopwords with markers, a sentence from our file looks like this:

['In', 'short,', 'the', 'divine', 'right', 'of', 'the', 'king', 'to', 'rule', 'over', 'his', 'people', 'was', 'proclaimed', 'as', 'loudly', 'in', 'the', 'colony', 'as', 'in', 'the', 'motherland.']

Afterwards, it looks like this:

['*', 'short,', '*', 'divine', 'right', '*', '*', 'king', '*', 'rule', '*', '*', 'people', '*', 'proclaimed', '*', 'loudly', '*', '*', 'colony', '*', '*', '*', 'motherland.']

For other applications, we may not need to use markers. To simply delete stopwords, we can use the following code instead:

filteredtext = [t for t in textwords if t.lower() not in stopwords]

This is what the same sentence looks like when stop words are deleted:

['short,', 'divine', 'right', 'king', 'rule', 'people', 'proclaimed', 'loudly', 'colony', 'motherland.']

Tags: | |