I have a large list of strings stored in one huge memory block (usually

Question

0

Asked: June 18, 20262026-06-18T15:16:02+00:00 2026-06-18T15:16:02+00:00

I have a large list of strings stored in one huge memory block (usually

0

I have a large list of strings stored in one huge memory block (usually there is 100k+ or even 1M+ of them). These are actually hashes, so the alphabet of the strings is limited to A-F0-9 and each string is exactly 32 bytes long (so its stored ‘compressed’). I will call this list the main list from now on.

I want to be able to remove items from the main list. This will be usually done in bulks, so i get a large list (about 100 to 10k usually) of hashes which i need to find in this list and remove them. At the end of this operation there cannot be any empty blocks in the large memory block, so i need to take that into account. It is not guaranteed that all of the items will be in the main list, but none will be there multiple times. No rellocation can be done, the main block will always stay the same size.

The naive approach of iterating through the main list and checking if given hash shall be removed of course works, but is a bit slow. Also there is a bit too much moving of small memory blocks, because every time when a hash is flagged for removal i rewrite it with the last element of the main list, thus satisfying the condition of no empty blocks. This of course creates thousands of small memcpy’s which in turn slow the thing down more because i get tons of cache misses.

Is there a better approach?

Some important notes:

the main list is not sorted and i cannot waste time sorting it, this
is a limitation imposed by the whole project and rewriting it so the
list is always sorted is not an option (it might not even be
possible)
memory is not really a problem, but the less is used the better
i can use STL, but not boost

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T15:16:04+00:00

Okay, here’s what I’d do if I absolutely had to optimize the hell out of this.
I’m assuming order doesn’t matter, which seems to be the case as you (IIUC) remove items by swapping them with the last item.

Store 128 bit integers (however you represent them, either your compiler supports them natively or you use a small array of 32/64 bit integers) instead of 32-char strings. See my comment on the question.
Roll my own hash set of 128 bit integers. Note that you can optimize a lot here if you’re willing to think a bit, make some assumption, and get down ‘n dirty. Some notes:
- You only need to store the hashes themselves (for collision resolution), and a bit or two of metadata to identify deleted/unused slots. Have a look at what existing hash tables do if you’re unsure how to guarantee correctness. I figure it’s even simpler if you only ever delete (not add) after building the hash set. Though I think you could even do without that metadata if you had a value that’s not a valid hash to denote empty slots, but this way removal is easier (just flip a bit, instead of overwriting 128 bit).
- You don’t need a hash function, as your inputs are already integers. You just need to do what every hash tables does anyway: Take the hashes modulo 2^n to derive an index that’s not freaking huge. Choose n such that the load factor (the percentage of table entries used) is reasonable (< 2/3 seems standard). Choosing a power of makes the modulo operation cheaper (masking off bits via binary AND), and allows you to just do it on the lower 32 or 64 bit (ignoring the rest).
- Choosing a collision resolution strategy is hard. I’d probably go with open addressing with linear probing, as first attempt. It may work badly, but if your input hashes are any good, this seems unlikely. There’s also a probing scheme that factors in more and more of the bits you originally cut off, used by CPython’s dict.

Now, this is a lot more work and maintenance burden than using off-the-shelf solutions. I wouldn’t advise it unless this really is as performance-critical as it sounds in your description.
If C++11 is an option, and your compiler’s unordered_set is any good, maybe you should just use it and save yourself most of the hassle (but be aware that this probably increases memory requirements). You still need to specialize std::hash and std::equal_to or operator==. Alternative supply your own Hash and KeyEqual for unordered_set, but that probably doesn’t offer any benefit.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large list of strings stored in one huge memory block (usually

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply