I have a tough problem right now. I am having a big dictionary file to be loaded in my program, its format basically is:
word1 val1
word2 val2
word3 val3
...
...
This file has 170k lines, and its size is 3.9MB on the disk (in plain text). In my implementation I used boost::unordered_map (hashtable) to store this data to support read-only looking up operations in my program.
Yet, after loaded into memory in runtime, the memory usage increased by 20MB due to the loading operation (I checked this via the Private Working Set Size in windows Task Manager, maybe this is not the right way to determine the memory usage? ). I know there must be some auxiliary data structures in the hashtable to store that data that will increase the memory usage, but I didn’t expect the memory size is as much as 5x than its disk size!
Is this normal? Since I have tried another version of hashmap in std extension library, and Trie structure in some other memory, none of them bring significant improvement on that issue.
So that I want to implement some space optimization over this problem. Could any one give some tips or keywords to guide me improve the space usage?
A hash map data structure allocates much more memory than it uses at one time. This is to facilitate quick inserts and removals. When a hash table reaches a certain capacity (implementation defined, but it’s a number like 50% full, 70% full, 90% full, etc.) it will reallocate more memory and copy everything over. The point is that it allocates more memory than is in use.
Also, the 20 MB that you see the program using is the size of all the memory your program is using, not just the one hash map.
Furthermore, if you are using
std::stringor equivalent structure to store the value, you already have created a copy of half the data you are getting from the file. You’ll have one copy in the buffer you read the file to, and then another copy in thestrings in the hash table.