For one of my course project I started implementing Naive Bayesian classifier in C.

Question

0

Asked: May 14, 20262026-05-14T08:40:27+00:00 2026-05-14T08:40:27+00:00

For one of my course project I started implementing Naive Bayesian classifier in C.

0

For one of my course project I started implementing “Naive Bayesian classifier” in C. My project is to implement a document classifier application (especially Spam) using huge training data.

Now I have problem implementing the algorithm because of the limitations in the C’s datatype.

( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )

PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 …. pn are probabilities of word-1, 2, 3 … n. The probability of doc being spam or not is calculated using

$alt text$

Here, probability value can be very easily around 0.01. So even if I use datatype “double” my calculation will go for a toss. To confirm this I wrote a sample code given below.

#define PROBABILITY_OF_UNLIKELY_SPAM_WORD     (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD     (0.99)

int main()
{
    int index;
    long double numerator = 1.0;
    long double denom1 = 1.0, denom2 = 1.0;
    long double doc_spam_prob;

    /* Simulating FEW unlikely spam words  */
    for(index = 0; index < 162; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom1    = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
    }
    /* Simulating lot of mostly definite spam words  */
    for (index = 0; index < 1000; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom1    = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
    }
    doc_spam_prob= (numerator/(denom1+denom2));
    return 0;
}

I tried Float, double and even long double datatypes but still same problem.

Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.

This is the first time I am hitting such issue. So how exactly should this problem be tackled?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T08:40:27+00:00

Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let’s rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you’ll underflow, and with too many big terms you’ll overflow the intermediate result.

So, don’t put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you’d multiply them straight away you would still have an underflow. But the order of calculation doesn’t matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.

The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that’s OK. At that point, you’ve consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.

There’s of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there’s no subsequent loss of precision.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For one of my course project I started implementing Naive Bayesian classifier in C.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply