I have question regarding the particular Naive Bayse algorithm that is used in document

Question

0

Asked: June 14, 20262026-06-14T07:15:06+00:00 2026-06-14T07:15:06+00:00

I have question regarding the particular Naive Bayse algorithm that is used in document

0

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:

construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior

What I am confused about is the part when we calculate the probability of each word given training set. For example for a word “banana”, it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of “banana” appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T07:15:07+00:00

I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words

Classify "Mentions Fruit":

"I like Bananas."

should be weighed no more or less than

"Bananas! Bananas! Bananas! I like them."

So the answer to your question would be 100/200 = 0.5.

The description of Document Classification on Wikipedia also supports my conclusion

Then the probability that a given document D contains all of the words W, given a class C, is

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.

By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.

UPDATE

My direct experience is based on short documents. I would like to highlight research that @BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically

One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.

A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have question regarding the particular Naive Bayse algorithm that is used in document

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply