Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9270017
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T15:16:02+00:00 2026-06-18T15:16:02+00:00

I have a large list of strings stored in one huge memory block (usually

  • 0

I have a large list of strings stored in one huge memory block (usually there is 100k+ or even 1M+ of them). These are actually hashes, so the alphabet of the strings is limited to A-F0-9 and each string is exactly 32 bytes long (so its stored ‘compressed’). I will call this list the main list from now on.

I want to be able to remove items from the main list. This will be usually done in bulks, so i get a large list (about 100 to 10k usually) of hashes which i need to find in this list and remove them. At the end of this operation there cannot be any empty blocks in the large memory block, so i need to take that into account. It is not guaranteed that all of the items will be in the main list, but none will be there multiple times. No rellocation can be done, the main block will always stay the same size.

The naive approach of iterating through the main list and checking if given hash shall be removed of course works, but is a bit slow. Also there is a bit too much moving of small memory blocks, because every time when a hash is flagged for removal i rewrite it with the last element of the main list, thus satisfying the condition of no empty blocks. This of course creates thousands of small memcpy’s which in turn slow the thing down more because i get tons of cache misses.

Is there a better approach?

Some important notes:

  • the main list is not sorted and i cannot waste time sorting it, this
    is a limitation imposed by the whole project and rewriting it so the
    list is always sorted is not an option (it might not even be
    possible)
  • memory is not really a problem, but the less is used the better
  • i can use STL, but not boost
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T15:16:04+00:00Added an answer on June 18, 2026 at 3:16 pm

    Okay, here’s what I’d do if I absolutely had to optimize the hell out of this.
    I’m assuming order doesn’t matter, which seems to be the case as you (IIUC) remove items by swapping them with the last item.

    • Store 128 bit integers (however you represent them, either your compiler supports them natively or you use a small array of 32/64 bit integers) instead of 32-char strings. See my comment on the question.
    • Roll my own hash set of 128 bit integers. Note that you can optimize a lot here if you’re willing to think a bit, make some assumption, and get down ‘n dirty. Some notes:
      • You only need to store the hashes themselves (for collision resolution), and a bit or two of metadata to identify deleted/unused slots. Have a look at what existing hash tables do if you’re unsure how to guarantee correctness. I figure it’s even simpler if you only ever delete (not add) after building the hash set. Though I think you could even do without that metadata if you had a value that’s not a valid hash to denote empty slots, but this way removal is easier (just flip a bit, instead of overwriting 128 bit).
      • You don’t need a hash function, as your inputs are already integers. You just need to do what every hash tables does anyway: Take the hashes modulo 2^n to derive an index that’s not freaking huge. Choose n such that the load factor (the percentage of table entries used) is reasonable (< 2/3 seems standard). Choosing a power of makes the modulo operation cheaper (masking off bits via binary AND), and allows you to just do it on the lower 32 or 64 bit (ignoring the rest).
      • Choosing a collision resolution strategy is hard. I’d probably go with open addressing with linear probing, as first attempt. It may work badly, but if your input hashes are any good, this seems unlikely. There’s also a probing scheme that factors in more and more of the bits you originally cut off, used by CPython’s dict.

    Now, this is a lot more work and maintenance burden than using off-the-shelf solutions. I wouldn’t advise it unless this really is as performance-critical as it sounds in your description.
    If C++11 is an option, and your compiler’s unordered_set is any good, maybe you should just use it and save yourself most of the hassle (but be aware that this probably increases memory requirements). You still need to specialize std::hash and std::equal_to or operator==. Alternative supply your own Hash and KeyEqual for unordered_set, but that probably doesn’t offer any benefit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large list of values (100-200 character strings) and I need to
I have large number of strings, approximately 15,000 that I stored in a SQLite
I have a large list (~ 110,000 strings), which I need to compare to
I have a Java program that reads in a large list of strings from
I have a large list (over 200,000) of strings that I'd like to compare
I have a large list of email addresses, and I need to determine which
I have a large list [[1,.., ..],[2,...,...],[5,...,...],[1,...,...]] I need to remove all elements that
I have a large list of file names that are illegal, I want to
Suppose I have a large list of words. For an example: >>> with open('/usr/share/dict/words')
Okay, basically, I have a large list of phone numbers in a text file

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.