Background I am working on a fairly computationally intensive project for a computational linguistics

Question

0

Asked: May 25, 20262026-05-25T14:35:01+00:00 2026-05-25T14:35:01+00:00

Background I am working on a fairly computationally intensive project for a computational linguistics

0

Background

I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.

Requirements

The key aspect of this particular program I must write is that it must:

Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.

Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)

Parenthesis: Technical Details

In case the actual procedure for building vectors affects the solution:

For each line in the corpus, one or more vectors must have its basis weights updated.

If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).

Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.

Attempted Solutions

There are three extrema when it comes to how to do this:

I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.

All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren’t fast. 3 is possibly even slower than 2 for the same reasons.

Goals

A good solution would involve:

Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.

The problem is that I’m not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don’t see how this sort of problem can be optimally solved without taking this into account. As a result, I don’t really know how to get started on this sort of thing.

Question

Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?

Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.

Additional Details

The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.

Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T14:35:02+00:00

Editorial Team

2026-05-25T14:35:02+00:00Added an answer on May 25, 2026 at 2:35 pm

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.

edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD ‘drives’.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Background I am working on a fairly computationally intensive project for a computational linguistics

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply