I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:
1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 …
Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.
I need to read this dataset and store it in a structure. Since it’s a very big and sparse dataset, I’m looking for a neat data structure to store the following variables:
- the index of each e-mail
- label of it (1 or -1)
- word and it’s frequency per each e-mail
- I also need to create a corpus of all words and their frequency with the label information
Any suggestions for such a data structure?
I would generate a class
You store your sparse vector in
bowdict, egand hold all your data in a list of
DocumentsTo get an aggregation about all docs with a given label :
You can persist all your data with the
cPicklemodule.Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.
Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.