As you can see below, when I open test.txt and put the words into a set, the difference of the set with the common_words set is returned. However, it is only removing a single instance of the words in the common_words set rather than all occurrences of them. How can I achieve this? I want to remove ALL instances of items in common_words from title_words
from string import punctuation
from operator import itemgetter
N = 10
words = {}
linestring = open('test.txt', 'r').read()
//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))
title = linestring
//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
words_gen = (word.strip(punctuation).lower() for line in keywords
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
I wrote some code recently that does something similar, although the style is very different from yours. Maybe it will help you out.