I was wondering whether someone might know the answer to the following. I’m using

Question

0

Asked: May 22, 20262026-05-22T11:30:12+00:00 2026-05-22T11:30:12+00:00

I was wondering whether someone might know the answer to the following. I’m using

0

I was wondering whether someone might know the answer to the following.

I’m using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.

When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.

When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?

Update

Thank you larsmans and Gurgeh for your very helpful explanations and advice. I’m using the tree as part of an information retrieval interface over a corpus of texts.

I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.

Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I’m still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. 😉

After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.

I’ll let you know how it goes.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T11:30:12+00:00

If you try to pickle an empty list, you get:

>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'

and similarly '(d.' for an empty dict. That’s three bytes. The in-memory representation of a list, however, contains

a reference count
a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
a pointer to a vector of pointers to actual elements
and yet more bookkeeping info.

On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that’s one order of magnitude. I assume an empty dict will have similar size.

Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there’s alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.

Summing up: pickle is a pretty good compression algorithm for Python objects 🙂

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was wondering whether someone might know the answer to the following. I’m using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply