Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6348525
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T21:26:12+00:00 2026-05-24T21:26:12+00:00

I’m taking some AI classes and have learned about some basic algorithms that I

  • 0

I’m taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.

I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.

What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.

In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.

By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I’d like a solution that allows me to do all or nearly all coding in this language.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T21:26:14+00:00Added an answer on May 24, 2026 at 9:26 pm

    Prototype–that’s the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter–e.g., python, R. That’s the best way to create and refine your analytics process flow at scale.

    In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.

    Here’s the workflow i use to do that–surely not the best way to do it, but it is one way, and it works:

    I. Use lazy loading methods (hopefully) available in your language of
    choice to read in large data files, particularly those exceeding about 1 GB. I
    would then recommend processing this data stream according to the
    techniques i discuss below, then finally storing this fully
    pre-processed data in a Data Mart, or intermediate staging container.

    One example using Python to lazy load a large data file:

    # 'filename' is the full path name for a data file whose size 
    # exceeds the memory on the box it resides. #
    
    import tokenize
    
    data_reader = open(some_filename, 'r')
    tokens = tokenize.generate_tokens(reader)
    tokens.next()           # returns a single line from the large data file.
    

    II. Whiten and Recast:

    • Recast your columns storing categorical
      variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
      a
      look-up table (the same hash as you used for this conversion
      except
      the keys and values are swapped out) to convert these integers
      back
      to human-readable string labels as the last step in your analytic
      workflow;

    • whiten your data–i.e., “normalize” the columns that
      hold continuous data. Both of these steps will substantially
      reduce
      the size of your data set–without introducing any noise. A
      concomitant benefit from whitening is prevention of analytics
      error
      caused by over-weighting.

    III. Sampling: Trim your data length-wise.

    IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the ‘outcomes’ or response variable) and eliminate them from your working data cube.

    Principal Component Analysis (PCA) is a simple and reliable technique to do this:

    import numpy as NP
    from scipy import linalg as LA
    
    D = NP.random.randn(8, 5)       # a simulated data set
    # calculate the covariance matrix: #
    R = NP.corrcoef(D, rowvar=1)
    # calculate the eigenvalues of the covariance matrix: #
    eigval, eigvec = NP.eig(R)
    # sort them in descending order: #
    egval = NP.sort(egval)[::-1]
    # make a value-proportion table #
    cs = NP.cumsum(egval)/NP.sum(egval)
    print("{0}\t{1}".format('eigenvalue', 'var proportion'))
    for i in range(len(egval)) :
        print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))
    
      eigenvalue    var proportion
        2.22        0.44
        1.81        0.81
        0.67        0.94
        0.23        0.99
        0.06        1.00
    

    So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:

    D = D[:,:-2]
    

    V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes–a ‘staging area’ that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R’s ‘data frame’ or ‘data table’ (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis–blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.

    from redis import Redis
    r0 = Redis(db=0)
    r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user', 
           traffic_source : 'affiliate', page_views_per_session : 17, 
           total_purchases : 28.15})
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a French site that I want to parse, but am running into
I have just tried to save a simple *.rtf file with some websites and
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I have some data like this: 1 2 3 4 5 9 2 6
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I want to count how many characters a certain string has in PHP, but
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a jquery bug and I've been looking for hours now, I can't
this is what i have right now Drawing an RSS feed into the php,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.