I’m a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but

Question

0

Asked: May 28, 20262026-05-28T01:35:20+00:00 2026-05-28T01:35:20+00:00

I’m a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but

0

I’m a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text?

>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt')
>>> len(reader.categories())
234

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T01:35:21+00:00

Assuming you want a naive Bayes classifier with bag of words features:

from nltk import FreqDist
from nltk.classify.naivebayes import NaiveBayesClassifier

def make_training_data(rdr):
    for c in rdr.categories():
        for f in rdr.fileids(c):
            yield FreqDist(rdr.words(fileids=[f])), c

clf = NaiveBayesClassifier.train(list(make_training_data(reader)))

The resulting clf‘s classify method can be used on any FreqDist of words.

(But note: from your cap_pattern, it seems you have sample and a single category per file in your corpus. Please check whether that’s really what you want.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply