Disclaimer: please do not bother reading this long post unless you are an embedded

Question

0

Asked: May 13, 20262026-05-13T08:54:02+00:00 2026-05-13T08:54:02+00:00

Disclaimer: please do not bother reading this long post unless you are an embedded

0

Disclaimer: please do not bother reading this long post unless you are an embedded programmer, or a kernel developer, or a high performance system software programmer. It is about memory allocators and might not be of any interest to you.

I am building a library for high-volume high-performance same machine IPC, using OS specific pipes (named pipes on windows and UNIX domain sockets on UNIX). The library receives messages, and inject them into a processing engine, and this engine could be in the same process in C++ or something like Erlang or Lua, or perhaps moved off to a central machine.

The messages at the moment come from within a web-server and is part of the resolution pipeline of a HTTP request, and should scale with the web-server — i.e., I cannot write an inefficient system because a web-servers worker threads depend on the speed with which they can offload these messages. If a web-server is capable of handling 10k requests of static pages, it should not be massively hampered by this IPC mechanism.

The data comes in as typed messages, and I believe they will have their own size statistics which could be leveraged. Doug lea’s allocator creates linked lists of buckets by sizes if I remember correctly, perhaps this is the way I should go – it is the standard of a general allocator.

I want zero-copy semantics so till the message is done processing it should not by copied around. On a side note: this perhaps takes Googles protocol buffers out of the picture (can’t be sure at the moment) as the wire format is different from the in-memory object format. I will investigate further — my current implementation uses my own IPC format; perhaps someone has some opinions of how protocol buffers does things, does it use zero-copy internally ?

I want to use this allocator in all areas of the application. My study of network stack software is urging me towards a processor-page-aligned-pool based allocator; my application certainly has the same access semantics. I would copy all messages into a page and only recycle the pages once all element references — that are allocated to the pool — are inactive.

To slightly formalize and help you understand what I am thinking about consider the following data structure:

typedef struct _page
{
   int active_elements;            // page is returned to empty pool at zero
   int free_bytes;                 // used to sort the pages
   void * next_free_contigious;    // pointer to where next allocation may occur
   _page * next_page;              // linked list next page pointer
} page;

struct pageAllocator
{
   page * ll_empty;       // completely empty pages
   page * ll_active;      // sorted by highest_size_available, could be binary tree 
                          // continue reading for rationale.
};

Some points are :

The messages won’t linger arround. So pages will become free fairly soon.
I need to write bullet-proof code to free the pages.
I do not know if page->next_free_contigious needs to be word alligned.
I assume that under pressure the amount of pages allocated will be stable.
Should I be returning pages to the OS or should I just keep them ? Maximum pressure would only allocate as much as it needs once maximum pressure has been established.
Should I allocate messages to the first element in the size sorted list (first being the highest), or should I find the best matching page. In this case I would have to use a binary tree of some sort. Am I re-implementing doug’s algorithm here ?

I want suggestions from people who have personally dealt with such things, alternate allocator suggestions would be welcome. Off the shelf libraries would be great (they have to be ANSI-C or a c-pre-processor meta-code implementation). The APR is a popular library with a hierarchical pool based strategy but I am not dealing with sessions at this low-level so I don’t need a hierarchy.

I don’t think I am pre-optimizing because I am convinced I need it. My current implementation of this system jumps to 30% CPU during web-server CPU saturation and I know this is because of my naive way of programming the application using new and a lot of copy constructing in C++. I will of-course benchmark my application before I proceed (I am too busy atm to re-build a test environment), but any discussion here will be useful regardless of the outcome.

— edit 1 —

This implementation was inspired from some comments from Markr, check answer below for some discussion. A bit more thinking lead to the following.

Ok I am thinking about a circular buffer now (say 1mb^x). Composed of equal sized elements (perhaps some multiple of 128-bytes). The allocations happen incrementally till the end (with a allocatex-head(A) marking where allocations may occur from, and a free-marker(F) marking contigious free regions(0) being returned to the pool. Consider the following ASCII illustration:

|000000F101110001111A00000000|

0=free, 1= occupied.

Some specific are:

The free-marker delimits free contiguous regions. In the illustration F will only move two forward when the immediately following chunk is returned.
Once the allocation reaches the end, or there isn’t enough space at the end to perform a contiguous allocation, the Allocate-head moves to the beginning; most of the buffer should be free by this point.
once a message is moved to its destination the chunks are immediately returned to the buffer.
On same machine delivery the messages will be delivered in short order (some blocking), or immediately if it just requires a thread-free function call.

Comments would be nice, and I would still prefer some self-contained portable library if anyone knows of any that does something similar. The GLIBC slice allocator looked nice, but I believe the above implementation is simpler and probably faster, and active regions on the buffer will always have good locality.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T08:54:03+00:00

After reading others’ comments, it sounds like using a fixed pool of objects and allocating ownership of them may be a better idea than trying to mess with allocators.

Presumably you have a large number of threads, so any allocator would need to be doing quite a bit of locking; a large number of small allocs/ frees sounds undesirable.

If you think you’ve got enough memory to keep reasonable sized pools per thread, you can just allocate a fixed pool per thread which contains enough IPC objects to handle all the requests that thread will have outstanding at one time, and allocate them per-thread from this pool with no locking whatsoever, marking them for reuse once those requests are complete.

Once you move the other processes off-box, the problem becomes completely different as zero-copy is no longer an option (the data clearly need to be copied at least once to reach another machine) so I think you’ll have other issues.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Disclaimer: please do not bother reading this long post unless you are an embedded

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply