I am going to implement Naive Bayes classifier with Python and classify e-mails as

Question

0

Asked: May 25, 20262026-05-25T20:08:46+00:00 2026-05-25T20:08:46+00:00

I am going to implement Naive Bayes classifier with Python and classify e-mails as

0

I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 …

Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.

I need to read this dataset and store it in a structure. Since it’s a very big and sparse dataset, I’m looking for a neat data structure to store the following variables:

the index of each e-mail
label of it (1 or -1)
word and it’s frequency per each e-mail
I also need to create a corpus of all words and their frequency with the label information

Any suggestions for such a data structure?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T20:08:46+00:00

I would generate a class

class Document(object):

    def __init__(self, index, label, bowdict):
        self.index = index
        self.label = label
        self.bowdict = bowdict

You store your sparse vector in bowdict, eg

{ 9:3, 94:1, 109:1,  ... }

and hold all your data in a list of Documents

To get an aggregation about all docs with a given label :

from collections import defaultdict

def aggregate(docs, label):
    bow = defaultdict(int)
    for doc in docs:
        if doc.label == label:
           for (word, counter) in doc.bowdict.items():
                bow[word] += counter  
    return bow

You can persist all your data with the cPickle module.

Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.

Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am going to implement Naive Bayes classifier with Python and classify e-mails as

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply