I’m pretty new to Python and NLTK but I had a question. I was

Question

0

Asked: June 17, 20262026-06-17T16:07:14+00:00 2026-06-17T16:07:14+00:00

I’m pretty new to Python and NLTK but I had a question. I was

0

I’m pretty new to Python and NLTK but I had a question.
I was writing something to extract only words longer than 7 characters from a self made corpus. But it turns out that it extracts every word…
Anyone know what I did wrong?

loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*)
def long_words(corpus)
    for cat in corpus.categories():
        fileids=corpus.fileids(categories=cat)
        words=corpus.words(fileids)
         long_tokens=[]
         words2=set(words)
         if len(words2) >=7:
             long_tokens.append(words2)


Print long_tokens

Thanks everyone!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T16:07:15+00:00

Replace

if len(words2) >=7:
    long_tokens.append(words2)

with:

long_tokens += [w for w in words2 if len(w) >= 7]

Explanation: what you were doing was you were appending all the words (tokens) produced by corpus.words(fileids) if the number of words was at least 7 (so I suppose always for your corpus). What you really wanted to do was to filter out the words shorter than 7 characters from the tokens set and append the remaining long words to long_tokens.

Your function should return the result – tokens having 7 characters or more. I assume the way you create and deal with CategorizedPlaintextCorpusReader is OK:

loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)

def long_words(corpus = Corpus):
    long_tokens=[]
    for cat in corpus.categories():
        fileids = corpus.fileids(categories=cat)
        words = corpus.words(fileids)
        long_tokens += [w for w in set(words) if len(w) >= 7]
    return set(long_tokens)

print "\n".join(long_words())

Here is an answer to the question you asked in the comments:

for loc in ['cat1','cat2']:
  print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m pretty new to Python and NLTK but I had a question. I was

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply