I’m building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
- The memory for v will be allocated in the constructor.
- v will be allocated once when an object of type A is created and will never need to be resized.
- The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
- Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
- What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
- If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
- Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
- If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
If you have a good reason to care about performance…
If they are both allocated with ‘new’, then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you’ve been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be “nearly contiguous”.
If the A instance is on the stack, it is highly unlikely that its ‘v’ will be nearby.
Allocate space for both, then placement new them into that space. It’s dirty, but it should typically work:
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won’t help a lot if you’re going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like
__pf(my_a->v);Maybe. If the fixed size buffer is usually close to the size you’ll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you’ll lose the opportunity for more objects to fit into the cache. I.e. it’s better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a “real-world” specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that’s a Google Docs link for a PDF, the PDF itself can be found here).