Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 83273
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T21:44:52+00:00 2026-05-10T21:44:52+00:00

In one of my current side projects, I am scanning through some text looking

  • 0

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I’d prefer to stick with standard python libraries, but I’m definitely open to other alternatives if they would work best.

So, does anyone have any suggestions for me?

Edited to add:

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn’t really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T21:44:52+00:00Added an answer on May 10, 2026 at 9:44 pm

    Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

    295760     S. Lott's answer 237984     S. Lott's with keys interned before passing in 203172 [*] d[(a,b,c)] = int(freq) 203156     d[a][b][c] = int(freq) 189132     keys.append((a,b,c)); freqs.append(int(freq)) 146132     d[intern(a),intern(b)][intern(c)] = int(freq) 145408     d[intern(a)][intern(b)][intern(c)] = int(freq)  83888 [*] d[a+' '+b+' '+c] = int(freq)  82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)  68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))  60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))  50556     pair array  48320     squeezed pair array  33024     squeezed single array 

    The entries marked [*] have no efficient way to look up a pair (a,b); they’re listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

    ‘Pair array’ is the scheme below in my original answer ("I’d start with the array with keys being the first two words…"), where the value table for each pair is represented as a single string. ‘Squeezed pair array’ is the same, leaving out the frequency values that are equal to 1 (the most common case). ‘Squeezed single array’ is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

    import collections  def build(file):     pairs = collections.defaultdict(list)     for line in file:  # N.B. file assumed to be already sorted         a, b, c, freq = line.split()         key = ' '.join((a, b))         pairs[key].append(c + ':' + freq if freq != '1' else c)     out = open('squeezedsinglearrayfile', 'w')     for key in sorted(pairs.keys()):         out.write('%s|%s\n' % (key, ' '.join(pairs[key])))  def load():     return open('squeezedsinglearrayfile').readlines()  if __name__ == '__main__':     build(open('freqs')) 

    I haven’t written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

    Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there’s a standard trick to strip out common prefixes, with another level of index to get them back, but that’s rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are ‘hello george’ and ‘hello world’, make the second entry be ‘6world’ instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I’d start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though — I think you may be out of luck as far as batteries-included memory-efficient options.

    Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

    I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your ‘real’ data structure, whatever it is, you use integer keys instead of strings — indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn’t suggest this after all.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I was looking at porting a small side project over to use Mongo, as
Our system using HttpContext.Current.Session(Client) to store the current user info. One property in the
For my current side project, which is a modular web management system (which could
I have multiple input fields on one page. Each input field have a text
For one of my projects, I want to implement a complete PAM implementation for
I currently have an element that has a solid border along one side of
I use Facebook JSSDK (i.e client-side) to verify who the current user to my
I am a current web2py user, but find I still go back to Django
We have a suite of related products written in VB6, with some C# and
Bit of a weird one - On just one page of a site I've

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.