Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 585995
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T15:03:40+00:00 2026-05-13T15:03:40+00:00

I have a file on disk that’s only 168MB. It’s just a comma separated

  • 0

I have a file on disk that’s only 168MB. It’s just a comma separated list of word,id.
The word can be 1-5 characters long. There’s 6.5 million lines.

I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB’s of RAM space used. Any idea why that is?

So let’s say my word file looks like this…

1,word1
2,word2
3,word3

Then add 6.5 million to that.
I then loop through that file and create a dictionary (python 2.6.1):

def load_term_cache():
    """will load the term cache from our cached file instead of hitting mysql. If it didn't
    preload into memory it would be 20+ million queries per process"""
    global cached_terms
    dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
    f = open(dumpfile)
    cache = csv.reader(f)
    for term_id, term in cache:
        cached_terms[term] = term_id
    f.close()

Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?

Update: I tried to use the anydb module and after 4.4 million records it just dies
the floating point number is the elapsed seconds since I tried to load it

56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57

You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.

import anydbm

i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
    i += 1
    pieces = line.split(',')
    db[str(pieces[1])] = str(pieces[0])
    if i > mark:
        print i
        print round(time.time() - starttime, 2)
        mark = i + 200000
db.close()
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T15:03:40+00:00Added an answer on May 13, 2026 at 3:03 pm

    Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the “it” that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.

    You say that “the word can be 1-5 words long”. What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?

    Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' … isn’t that bassackwards?

    Update 3: More questions: How is the “word” encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?

    Update 4 … You asked “how to most efficiently store key/value pairs in memory with python” and nobody’s answered that yet with any accuracy.

    You have a 168 Mb file with 6.5 million lines. That’s 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it’s a *x platform) and we’re left with 25 bytes per line. Assuming the “id” is intended to be unique, and as it appears to be an integer, let’s assume the “id” is 7 bytes long; that leaves us with an average size of 18 bytes for the “word”. Does that match your expectation?

    So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.

    Let’s assume a 32-bit CPython 2.6 platform.

    >>> K = sys.getsizeof('123456789012345678')
    >>> V = sys.getsizeof('1234567')
    >>> K, V
    (42, 31)
    

    Note that sys.getsizeof(str_object) => 24 + len(str_object)

    Tuples were mentioned by one answerer. Note carefully the following:

    >>> sys.getsizeof(())
    28
    >>> sys.getsizeof((1,))
    32
    >>> sys.getsizeof((1,2))
    36
    >>> sys.getsizeof((1,2,3))
    40
    >>> sys.getsizeof(("foo", "bar"))
    36
    >>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
    36
    >>>
    

    Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) … it only allows for a pointer to each item, it doesn’t allow for the sizes of the items.

    A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) … again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn’t have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent — see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn’t change).

    Here are the costs of various alternatives to dict for a memory-based look-up table:

    List of tuples:

    Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.

    Total for list of tuples: 36 + N * (40.5 + K + v)

    That’s 26 + 113.5 * N (about 709 MB when is 6.5 million)

    Two parallel lists:

    (36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)
    i.e. 72 + N * (9 + K + V)

    Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.

    Value stored as int not str:

    But that’s not all. If the IDs are actually integers, we can store them as such.

    >>> sys.getsizeof(1234567)
    12
    

    That’s 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.

    Use array.array(‘l’) instead of list for the (integer) value:

    We can store those 7-digit integers in an array.array(‘l’). No int objects, and no pointers to them — just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that’s 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.

    So we’re down to 709 – 200 – 118 – 76 = about 315 MB.

    N.B. Errors and omissions excepted — it’s 0127 in my TZ 🙁

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In the filesystem I have /file.aspx /directory/default.aspx I want to configure IIS so that
I have a file that I want to include in Python but the included
I have a file which is an XML representation of some data that is
I have a file temp.txt, that I want to sort with the sort command
I have a file that lists filenames, each on it's own line, and I
We have a file that has a 64 bit integer as a string in
I have a file in the following format: Data Data Data [Start] Data I
I have Windows File sharing enabled on an OS X 10.4 computer. It's accessible
I have a file saved as UCS-2 Little Endian I want to change the
I have a file. I want to get its contents into a blob column

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.