I’m using a 1.6M tweet corpus to train a naive bayes sentiment engine. I

Question

0

Asked: May 25, 20262026-05-25T20:45:21+00:00 2026-05-25T20:45:21+00:00

I’m using a 1.6M tweet corpus to train a naive bayes sentiment engine. I

0

I’m using a 1.6M tweet corpus to train a naive bayes sentiment engine.

I have two Dictionaries of n-grams (Dictionary<string,int> where the string is my n-gram and the int is the # of occurrences of the n-gram in my corpus). The first list is pulled from the positive tweets, the second list is pulled from the negative tweets. In an article on this subject, the authors discard common n-grams (i.e. n-grams that do not strongly indicate any sentiment nor indicate objectivity of a sentence. Such n-grams appear evenly in all datasets). I understand this quite well conceptually, but the formula they provide is rooted in mathematics, not code, and I’m not able to decipher what I’m supposed to be doing.

I have spent the last few hours searching the web for how to do this. I’ve found examples of entropy calculation for search engines, which is usually calculating the entropy of a string, and the most common code block has been ShannonsEntropy.

I’m also relatively new to this space, so I’m sure my ignorance is playing a bit of a part in this, but I’m hoping somebody on SO can help nudge me in the right direction. To summarize:

Given two Dictionaries, PosDictionary & NegDictionary, how do I calculate the entropy of identical n-grams?

Psuedo-code is fine, and I imagine it looks something like this:

foreach(string myNGram in PosDictionary) {
    if(NegDictionary.ContainsKey(myNGram) {
        double result = CalculateEntropyOfNGram(myNGram);
        if(result > someThetaSuchAs0.80) {
            PosDictionary.Remove(myNGram);
            NegDictionary.Remove(myNGram);
        }
    }
}

I think that’s the process I’ll need to take. What I don’t know is what the CalculateEntropyOfNGram function looks like…

(Edit)
Here is the link to the pdf used to describe the entropy/salience process (section 5.3)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T20:45:22+00:00

Editorial Team

2026-05-25T20:45:22+00:00Added an answer on May 25, 2026 at 8:45 pm

Equation (10) in the paper gives the definition. If you have problems reading the equation, it is a short notation for

    H(..) = -log(p(S1|g)) * p(S1|g)  - log(p(S2|g)) * p(S2|g) - ....

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using a 1.6M tweet corpus to train a naive bayes sentiment engine. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply