Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8826883
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T07:15:06+00:00 2026-06-14T07:15:06+00:00

I have question regarding the particular Naive Bayse algorithm that is used in document

  • 0

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:

  1. construct some probability of each word in the training set for each known classification
  2. given a document we strip all the words that it contains
  3. multiply together the probabilities of the words being present in a classification
  4. perform (3) for each classification
  5. compare the result of (4) and choose the classification with the highest posterior

What I am confused about is the part when we calculate the probability of each word given training set. For example for a word “banana”, it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of “banana” appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T07:15:07+00:00Added an answer on June 14, 2026 at 7:15 am

    I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words

    Classify "Mentions Fruit":

    "I like Bananas."

    should be weighed no more or less than

    "Bananas! Bananas! Bananas! I like them."

    So the answer to your question would be 100/200 = 0.5.

    The description of Document Classification on Wikipedia also supports my conclusion

    Then the probability that a given document D contains all of the words W, given a class C, is

    http://en.wikipedia.org/wiki/Naive_Bayes_classifier

    In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.

    By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.

    UPDATE

    My direct experience is based on short documents. I would like to highlight research that @BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically

    One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.

    A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

got a question regarding serializing classes that I've defined. I have some classes like
I have a question regarding a race condition scenario. The question: Consider the following
Have a question regarding something which has been bugging me for some time now.I'm
I have a quick question regarding a database that I am designing and making
I have general question regarding the use of pointers vs. references in this particular
I have a question regarding checkboxes. <form method=post> I speak the following languages: <input
I have a very simply question regarding IEquatable. Given the following basic classes: public
My question concerns bit manipulation when the endianess changes. In particular I have some
The question regarding the Umbraco CMS: I have to setup public access for particular
I have a question regarding SQL syntax and whether or not a particular action

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.