When doing a word frequency count on my corpus, the results seem inacurate (are

Question

0

Asked: June 17, 20262026-06-17T20:45:47+00:00 2026-06-17T20:45:47+00:00

When doing a word frequency count on my corpus, the results seem inacurate (are

0

When doing a word frequency count on my corpus, the results seem inacurate (are not the most frequent words to my feeling, and the frequency count is only one or two) and some results show ‘as over\xe2’ and ‘\xad’. Can anyone help?

def toptenwords(mycorpus):
  mywords = mycorpus.words()
  nocapitals = [word.lower() for word in mywords] 
  filtered = [word for word in nocapitals if word not in stoplist]
  nopunctuation= [s.translate(None, string.punctuation) for s in filtered] 
  woordcounter = {}
     for word in nopunctuation:
        if word in wordcounter:
          woordcounter[word] += 1
        else:
          woordcounter[word] = 1
    frequentwords = sorted(wordcounter.iteritems(), key = itemgetter(1), reverse = True)
    top10 = frequentwords[:10]
    woord1 = frequentwords[1]
    woord2 = frequentwords[2]
    woord3 = frequentwords[3]
    woord4 = frequentwords[4]
    woord5 = frequentwords[5]
    woord6 = frequentwords[6]
    woord7 = frequentwords[7]
    woord8 = frequentwords[8]
    woord9 = frequentwords[9]
    woord10 = frequentwords[10]
    print "De 10 meest frequente woorden zijn: ", woord1, ",", woord2, ",", woord3, ",",    woord4, ",", woord5, ",", woord6, ",", woord7, ",", woord8, ",", woord9, "en", woord10

Code is originally in dutch, this is the NOT translated code:

def toptienwoorden(mycorpus):
   woorden = mycorpus.words()
   zonderhoofdletters = [word.lower() for word in woorden] 
   gefiltered = [word for word in zonderhoofdletters if word not in stoplijst] 
   geenleestekens = [s.translate(None, string.punctuation) for s in gefiltered] 
   woordteller = {}
   for word in geenleestekens:
      if word in woordteller:
         woordteller[word] += 1
      else:
         woordteller[word] = 1
  frequentewoorden = sorted(woordteller.iteritems(), key = itemgetter(1), reverse = True)
  top10 = frequentewoorden[:10]
  woord1 = frequentewoorden[1]
  woord2 = frequentewoorden[2]
  woord3 = frequentewoorden[3]
  woord4 = frequentewoorden[4]
  woord5 = frequentewoorden[5]
  woord6 = frequentewoorden[6]
  woord7 = frequentewoorden[7]
  woord8 = frequentewoorden[8]
  woord9 = frequentewoorden[9]
  woord10 = frequentewoorden[10]
  print "De 10 meest frequente woorden zijn: ", woord1, ",", woord2, ",", woord3, ",",    woord4, ",", woord5, ",", woord6, ",", woord7, ",", woord8, ",", woord9, "en", woord10

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T20:45:49+00:00

Use a collections.Counter. It is perfect for counting the frequency of (hashable) items and it has a most_common method which can return the top ten most frequent items without you having to code the logic yourself:

import string
import collections


def topNwords(mywords, N = 10, stoplist = set(), filtered = set()):
    # mywords = mycorpus.words()
    nocapitals = [word.lower() for word in mywords]
    filtered = [word for word in nocapitals if word not in stoplist]
    nopunctuation = [s.translate(None, string.punctuation) for s in filtered]
    woordcounter = collections.Counter(nopunctuation)
    top_ten = [word for word, freq in woordcounter.most_common(N)]
    return top_ten


top_ten = topNwords('This is a test. It is only a test. In case of a real emergency'.split(), N = 10)
print("De 10 meest frequente woorden zijn: {w}".format(w = ', '.join(top_ten)))

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When doing a word frequency count on my corpus, the results seem inacurate (are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply