I’ve seen a bunch of posts that do basically what I’m doing, but unfortunately I’m not sure why I keep getting output that is not what I want. The problem is that I am trying to increment a dictionary every time a certain word appears in my excel file, but every instance of a word is treated as a new word as my code currently is. For example “the” occurs ~50 times in my file, but the output just lists “the” on many different lines with a count of “1” for each instance. When in reality I want “the” to be listed once, with a count of “50”. Would greatly appreciate any clarification! Here is my code:
import csv
import string
filename = "input.csv"
output = "output1.txt"
def add_word(counts, word):
word = word.lower()
#the problem is here, the following line never runs
if counts.has_key(word):
counts[word] +=1
#instead, we always go to the else statement...
else:
counts[word] = 1
return counts
def count_words(text):
word = text.lower()
counts = {}
add_word(counts, word)
return counts
def main():
infile = open(filename, "r")
input_fields = ('name', 'country')
reader = csv.DictReader(infile, fieldnames = input_fields)
next(reader)
first_row = next(reader)
outfile = open(output, "w")
outfile.write("%-18s%s\n" %("Word", "Count"))
for next_row in reader:
full_name = first_row['name']
word = text.split(' ',1)[0]
counts = count_words(word)
counts_list = counts.items()
counts_list.sort()
for word in counts_list:
outfile.write("%-18s%d\n" %(word[0], word[1]))
first_row = next_row
if __name__=="__main__":
main()
Using plain dictionaries, the dict.get method is well suited to counting:
The collections module offers two ways of simplifying this code.
Here’s one using collections.Counter
And there is the collections.defaultdict approach:
The regular dictionary approach is most suitable when your output needs to be a regular dictionary or when you’re using an older version of Python.
The Counter approach is easy-to-use and has a number of utilities well-suited to counting applications (for example, the most_common() method lists the n biggest counts in sorted order). A backport of Counter is available for versions of Python prior to 2.7.
The defaultdict approach has some disadvantages. Merely accessing a missing value will cause the dictionary to grow. Also, to use it, you need to understand factory functions and know that int() can be called with no arguments to produce a zero value.