I have a small python script that prints the 10 most frequent words of a text document (with each word being 2 letters or more) and I need to continue the script to print the 10 most INfrequent words in the document as well. I have a script that is relatively working, except the 10 most infrequent words it prints are numbers (integers and floaters) when they should be words. How can I iterate ONLY words and exclude the numbers? Here is my full script:
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
number = 10
words = {}
with open("charactermask.txt") as txt_file:
words = [x.strip(punctuation).lower() for x in txt_file.read().split()]
counter = defaultdict(int)
for word in words:
if len(word) >= 2:
counter[word] += 1
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
EDIT: The end of the document (the part under the # Least Frequent Words comment) is the part that needs fixing.
You’re going to need a filter — change the regex to match however you want to define a “word”:
Now, do you want the word frequency table to not include numbers in the first place?
Or do you just want to skip over the numbers when extracting the least-frequent words from the table?