Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6612839
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T20:08:46+00:00 2026-05-25T20:08:46+00:00

I am going to implement Naive Bayes classifier with Python and classify e-mails as

  • 0

I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 …

Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.

I need to read this dataset and store it in a structure. Since it’s a very big and sparse dataset, I’m looking for a neat data structure to store the following variables:

  • the index of each e-mail
  • label of it (1 or -1)
  • word and it’s frequency per each e-mail
  • I also need to create a corpus of all words and their frequency with the label information

Any suggestions for such a data structure?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T20:08:46+00:00Added an answer on May 25, 2026 at 8:08 pm

    I would generate a class

    class Document(object):
    
        def __init__(self, index, label, bowdict):
            self.index = index
            self.label = label
            self.bowdict = bowdict
    

    You store your sparse vector in bowdict, eg

    { 9:3, 94:1, 109:1,  ... } 
    

    and hold all your data in a list of Documents

    To get an aggregation about all docs with a given label :

    from collections import defaultdict
    
    def aggregate(docs, label):
        bow = defaultdict(int)
        for doc in docs:
            if doc.label == label:
               for (word, counter) in doc.bowdict.items():
                    bow[word] += counter  
        return bow    
    

    You can persist all your data with the cPickle module.

    Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.

    Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm going to implement license key generator on my website but I've very few
I'm going to implement my strings with internationalization in my Android app. I have
I'm going to implement a tokenizer in Python and I was wondering if you
i am going to implement the collapsiblepane in my application but it is not
I am going to implement the Calculator type application. In that I have set
What's the best way going forward to implement Wake on LAN using C#? The
I have a web service implemented in WCF. This service is only going to
Going round in circles here i think. I have an activity called Locate; public
I am trying to implement a reoccurring timer function asp.net. I am not able
I'm going to implement JSON-RPC web service. I need specifications for this. So far

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.