I’m pretty new to Python and NLTK but I had a question.
I was writing something to extract only words longer than 7 characters from a self made corpus. But it turns out that it extracts every word…
Anyone know what I did wrong?
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*)
def long_words(corpus)
for cat in corpus.categories():
fileids=corpus.fileids(categories=cat)
words=corpus.words(fileids)
long_tokens=[]
words2=set(words)
if len(words2) >=7:
long_tokens.append(words2)
Print long_tokens
Thanks everyone!
Replace
with:
Explanation: what you were doing was you were appending all the words (tokens) produced by
corpus.words(fileids)if the number of words was at least 7 (so I suppose always for your corpus). What you really wanted to do was to filter out the words shorter than 7 characters from the tokens set and append the remaining long words tolong_tokens.Your function should return the result – tokens having 7 characters or more. I assume the way you create and deal with
CategorizedPlaintextCorpusReaderis OK:Here is an answer to the question you asked in the comments: