Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7005609
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T21:22:21+00:00 2026-05-27T21:22:21+00:00

EDIT: The correct version of the code which works can be found at :

  • 0

EDIT: The correct version of the code which works can be found at :
https://github.com/a7x/NaiveBayes-Classifier

I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.

  1. For training, I calculated the log likelihood by the formula :

    log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)

    My question is: why did we add the vocabSize in this case 🙁 and is this the correct way of going about it? Code used is below:

    #This is for training.     Calculate all probabilities and store them in a vector. Better to store it in a file  for easier access 
    from __future__ import division
    import sys,os
    ''' 
    1. The spam and non-spam is already 50%  . So they by default are 0.5
    2. Now we need to calculate probability of each word    , in spam and non-spam separately
      2.1  we can make two dictionaries, defaultdicts basically,  for spam and non-spam 
      2.2 When time comes to calculate probabilities, we just need to substitute values
    '''
    from collections import *
    from math import *
    
    spamDict = defaultdict(int)
    nonspamDict = defaultdict(int)
    spamFolders = ["spam-train"]
    nonspamFolders = ["nonspam-train"]
    path = sys.argv[1] #Base path
    spamVector = open(sys.argv[2],'w') #WRite all spam values into this 
    nonspamVector = open(sys.argv[3],'w') #Non-spam values
    
    #Go through all files in spam and  iteratively add values
    spamSize = 0
    nonspamSize = 0
    vocabSize = 264821
    for f in os.listdir(os.path.join(path,spamFolders[0])):
        data = open(os.path.join(path,spamFolders[0],f),'r')
    
        for line in data:
            words = line.split(" ")
            spamSize = spamSize + len(words)
            for w in words:
                spamDict[w]+=1
    
    for f in os.listdir(os.path.join(path,nonspamFolders[0])):
        data = open(os.path.join(path,nonspamFolders[0],f),'r')
        for line in data:
            words = line.split(" ")
            nonspamSize = nonspamSize + len(words)
            for w in words:
    
                nonspamDict[w]+=1
    logProbspam = {}
    logProbnonSpam = {} #This is to store the log probabilities
    for k in spamDict.keys():
        #Need to calculate P(x | y = 1)
    
        numerator =  spamDict[k] + 1  # Frequency
        print 'Word',k,' frequency',spamDict[k]
        denominator = spamSize + vocabSize
        p = log(numerator/denominator)
        logProbspam[k] = p
    for k in nonspamDict.keys():
        numerator = nonspamDict[k] + 1 #frequency
        denominator = nonspamSize + vocabSize
        p = log(numerator/denominator)
        logProbnonSpam[k] = p
    
    for k in logProbnonSpam.keys():
        nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
    for k in logProbspam.keys():
        spamVector.write(k+" "+str(logProbspam[k])+"\n")
    
  2. For prediction, I just took a mail , split it into words, added all the probabilities, separately for
    spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:

    http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)

EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words

Problem : My accuracy is quite bad . As noted below.

    No of files in spam is 130
    No. of spam in  ../NaiveBayes/spam-test  is  53  no. of non-spam 77
    No of files in non-spam is 130
    No. of spam in  ../NaiveBayes/nonspam-test/  is  6  no. of non-spam 124

Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T21:22:22+00:00Added an answer on May 27, 2026 at 9:22 pm

    There are multiple errors and bad assumptions in your program – in both parts of it. Here are several.

    1. You hardcode into your program the fact that you have the same number of spam and non-spam emails. I’d recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
    2. You hadrcode into your program some number that you treat as the vocabulary size. I’d not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it’s incorrect. I’d recommend to calculate it during learning.
    3. This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that’s not essential for obtaining correct results – even without this filtering, my implementation is getting only several emails unclassified.
    4. You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important – you need to store this probability with your data for the program to function correctly.
    5. You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
    6. Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.

    Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you’re working with this dataset, as well as 3 is not required.
    Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
    Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
    The problem number 2 is also still in place – the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

EDIT: The code below has been edited to show the correct solution to the
Suppose you have legacy java code which can not be compiled by an up-to-date
Edit: This question was written in 2008, which was like 3 internet ages ago.
EDIT: This question is more about language engineering than C++ itself. I used C++
EDIT What small things which are too easy to overlook do I need to
Edit: This was accidentally posted twice. Original: VB.NET Importing Classes I've seen some code
EDIT: Here's the video showing the problem: www.youtube.com/watch?v=5ZZsuMH5T9k I'm really stuck here :/ --
Hi there's a live version of the code below (taken from a tutorial) at
Source Code: if (navigator.userAgent.toLowerCase().indexOf('msie') != -1 && parseInt($.browser.version, 10)<7) Edited - sorry, posted wrong
Edit: From another question I provided an answer that has links to a lot

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.