Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5838769
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T11:30:12+00:00 2026-05-22T11:30:12+00:00

I was wondering whether someone might know the answer to the following. I’m using

  • 0

I was wondering whether someone might know the answer to the following.

I’m using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.

When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.

When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?

Update

Thank you larsmans and Gurgeh for your very helpful explanations and advice. I’m using the tree as part of an information retrieval interface over a corpus of texts.

I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.

Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I’m still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. 😉

After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.

I’ll let you know how it goes.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T11:30:12+00:00Added an answer on May 22, 2026 at 11:30 am

    If you try to pickle an empty list, you get:

    >>> s = StringIO()
    >>> pickle.dump([], s)
    >>> s.getvalue()
    '(l.'
    

    and similarly '(d.' for an empty dict. That’s three bytes. The in-memory representation of a list, however, contains

    • a reference count
    • a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
    • a pointer to a vector of pointers to actual elements
    • and yet more bookkeeping info.

    On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that’s one order of magnitude. I assume an empty dict will have similar size.

    Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there’s alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.

    Summing up: pickle is a pretty good compression algorithm for Python objects 🙂

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I came across this post and found belisarius' answer interesting. Wondering whether he someone
Has anybody here tried using dhtmlxtabbar? I was wondering whether someone has already compared
I was wondering whether someone can shed some light on the following issue: We've
I am wondering whether I can upgrade a basic IoC container I am using
I am no css expert so I am wondering whether someone could help. I
I am woundering whether someone has time to normalize my database? I dont know
I am looking through the code someone wrote a while back and wondering whether
I was wondering whether Boost.Format does support using a fixed-width / preallocated buffer as
I'm not very good in rewriting uris and was wondering whether someone could help
Given that gae & django persistence layers are quite similar, I'm wondering whether someone

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.