Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6569095
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T14:35:01+00:00 2026-05-25T14:35:01+00:00

Background I am working on a fairly computationally intensive project for a computational linguistics

  • 0

Background

I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.

Requirements

The key aspect of this particular program I must write is that it must:

  1. Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
  2. Process the data on each line.
  3. From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
  4. These vectors must all be saved to disk in some format or other.

Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)

Parenthesis: Technical Details

In case the actual procedure for building vectors affects the solution:

For each line in the corpus, one or more vectors must have its basis weights updated.

If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).

Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.

Attempted Solutions

There are three extrema when it comes to how to do this:

  1. I could build all the vectors in memory. Then write them to disk.
  2. I could build all the vectors directly on the disk, using shelf of pickle or some such library.
  3. I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.

All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren’t fast. 3 is possibly even slower than 2 for the same reasons.

Goals

A good solution would involve:

  1. Building as much as possible in memory.
  2. Once memory is full, dump everything to disk.
  3. If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
  4. Go back to 1 until all vectors are built.

The problem is that I’m not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don’t see how this sort of problem can be optimally solved without taking this into account. As a result, I don’t really know how to get started on this sort of thing.

Question

Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?

Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.

Additional Details

The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.

Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T14:35:02+00:00Added an answer on May 25, 2026 at 2:35 pm

    take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.

    edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD ‘drives’.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Background: Trevor was working on a test project solely for the purpose of trying
Background: We have an offshore group working up a Silverlight 2 prototype for us.
I have to develop a fairly large ASP.NET MVC project very quickly and I
I'm fairly new to Git, and have been working with it for only 3
Background Working in .NET 2.0 Here, reflecting lists in general. I was originally using
Background: I am working on a legacy DB2 database, so I have no control
I have background service which access my SQL Server database. My background service working
I'm currently working on a project in which i need to play a background
I come from a fairly strong C background, and have a rather solid foundation
I'm working on a fairly big web project. The site features a colorful landscape

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.