Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7974887
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T08:30:24+00:00 2026-06-04T08:30:24+00:00

I use a custom heap implementation in one of my projects. It consists of

  • 0

I use a custom heap implementation in one of my projects. It consists of two major parts:

  1. Fixed size-block heap. I.e. a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.

    It performs allocation/freeing fast (in O(1)) and there’s no memory usage overhead, not taking into account things imposed by the external heap.

  2. Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.

    Since the whole application is (heavily) multi-threaded – the global heap locks the appropriate bucket during its operation.

    Note: in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the “bucket configuration” is known at compile-time (implemented via C++ template voodoo) – the appropriate bucket is determined at compile time.

So far everything looks (and works) good.

Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking. That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded – the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.

I’ve fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.

This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?

EDIT:

Special thanks to @Voo for suggesting checking out the google’s tcmalloc.

It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching.

I too thought in this direction, but I thought about maintaining per-thread heaps. Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) – no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.

OTOH the idea with caching seems much simpler, and more efficient. I’ll try to work it out.

Thanks a lot.

P.S.:

Indeed google’s tcmalloc is great. I believe it’s implemented pretty much similar to what I did (at least fixed-size part).

But, to be pedantic, there’s one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It’s 4/64K to be exact.

🙂

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T08:30:26+00:00Added an answer on June 4, 2026 at 8:30 am

    One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).

    When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it’s last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).

    This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, e.g. by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.

    An advantage of this scheme is that it will tend to eliminate false sharing, as memory for a given thread will tend to be allocated in contiguous address spaces.

    On a side note, if you have not already read it, I suggest IBM’s Inside Memory Management article for anyone implementing their own memory management.

    UPDATE

    If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.

    • Hoarde
    • tcmalloc (thanks Voo)
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two apps that use custom URL schemes to switch between each other.
I use custom routes for my URLs and my action become accessible via two
I want to use custom action filter to manipulate parameters to one action. User
Related questions: Java PriorityQueue with fixed size How do I use a PriorityQueue? get
I wanna use custom font file. For that below is my code XML file:
I've got an app that I want to be able to use Custom URL
I understand that to use custom fonts in a widget I need to render
Is there a way one can use custom priorities in syslog daemon or rsyslog
I want to use custom barbutton that will have some image. I am doing
We use custom type to represent Identifiers in our project. It has TypeConvertor attached

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.