Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 83273
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T21:44:52+00:00 2026-05-10T21:44:52+00:00

In one of my current side projects, I am scanning through some text looking

  • 0

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I’d prefer to stick with standard python libraries, but I’m definitely open to other alternatives if they would work best.

So, does anyone have any suggestions for me?

Edited to add:

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn’t really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T21:44:52+00:00Added an answer on May 10, 2026 at 9:44 pm

    Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

    295760     S. Lott's answer 237984     S. Lott's with keys interned before passing in 203172 [*] d[(a,b,c)] = int(freq) 203156     d[a][b][c] = int(freq) 189132     keys.append((a,b,c)); freqs.append(int(freq)) 146132     d[intern(a),intern(b)][intern(c)] = int(freq) 145408     d[intern(a)][intern(b)][intern(c)] = int(freq)  83888 [*] d[a+' '+b+' '+c] = int(freq)  82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)  68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))  60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))  50556     pair array  48320     squeezed pair array  33024     squeezed single array 

    The entries marked [*] have no efficient way to look up a pair (a,b); they’re listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

    ‘Pair array’ is the scheme below in my original answer ("I’d start with the array with keys being the first two words…"), where the value table for each pair is represented as a single string. ‘Squeezed pair array’ is the same, leaving out the frequency values that are equal to 1 (the most common case). ‘Squeezed single array’ is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

    import collections  def build(file):     pairs = collections.defaultdict(list)     for line in file:  # N.B. file assumed to be already sorted         a, b, c, freq = line.split()         key = ' '.join((a, b))         pairs[key].append(c + ':' + freq if freq != '1' else c)     out = open('squeezedsinglearrayfile', 'w')     for key in sorted(pairs.keys()):         out.write('%s|%s\n' % (key, ' '.join(pairs[key])))  def load():     return open('squeezedsinglearrayfile').readlines()  if __name__ == '__main__':     build(open('freqs')) 

    I haven’t written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

    Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there’s a standard trick to strip out common prefixes, with another level of index to get them back, but that’s rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are ‘hello george’ and ‘hello world’, make the second entry be ‘6world’ instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I’d start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though — I think you may be out of luck as far as batteries-included memory-efficient options.

    Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

    I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your ‘real’ data structure, whatever it is, you use integer keys instead of strings — indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn’t suggest this after all.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 76k
  • Answers 76k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • added an answer Have you considered updating Ruby on the Windows platform? I… May 11, 2026 at 2:56 pm
  • added an answer From the W3's HTML4 specification: The label itself may be… May 11, 2026 at 2:56 pm
  • added an answer You can use BigInteger from the J# classes. First question… May 11, 2026 at 2:56 pm

Related Questions

As a side project I'm currently writing a server for an age-old game I
Since our switch from Visual Studio 6 to Visual Studio 2008, we've been using
I'm working on a project where I need the following. WCF service on the
What is there to do when the lead developer is convinced the project will

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.