Profiling my code, i see a lot of cache misses and would like to know whether there is a way to improve the situation. Optimization is not really needed, I’m more curious about whether there exist general approaches to this problem (this is a follow up question).
// class to compute stuff
class A {
double compute();
...
// depends on other objects
std::vector<A*> dependencies;
}
I have a container class that stores pointers to all created objects of class A. I do not store copies as I want to have shared access. Before I was using shared_ptr, but as single As are meaningless without the container, raw pointers are fine.
class Container {
...
void compute_all();
std::vector<A*> objects;
...
}
The vector objects is insertion sorted in a way that the full evaluation can be done by simply iterating and calling A.compute(), all dependencies in A are resolved.
With a_i objects of class A, the evaluation might look like this:
a_1 => a_2 => a_3 --> a_2 --> a_1 => a_4 => ....
where => denotes iteration in Container and –> iteration over A::dependencies
Moreover, the Container class is created only once and compute_all() is called many times, so rearranging the whole structure after creation is an option and wouldn’t harm efficiency much.
Now to the observations/questions:
-
Obviously, iterating over
Container::objectsis cache efficient, but accessing the pointees is definitely not. -
Moreover, as each object of type
Ahas to iterate overA::dependencies, which again can produces cache misses.
Would it help to create a separate vector<A*> from all needed object in evaluation order such that dependencies in A are inserted as copies?
Something like this:
a_1 => a_2 => a_3 => a_2_c => a_1_c => a_4 -> ....
where a_i_c are copies from a_i.
Thanks for your help and sorry if this question is confusing, but I find it rather difficult to extrapolate from simple examples to large applications.
Unfortunately, I’m not sure if I’m understanding your question correctly, but I’ll try to answer.
Cache misses are caused by the processor requiring data that is scattered all over memory.
One very common way of increasing cache hits is just organizing your data so that everything that is accessed sequentially is in the same region of memory. Judging by your explanation, I think this is most likely your problem; your
Aobjects are scattered all over the place.If you’re just calling regular
newevery single time you need to allocate anA, you’ll probably end up with all of yourAobjects being scattered.You can create a custom allocator for objects that will be creating many times and accessed sequentially. This custom allocator could allocate a large number of objects and hand them out as requested. This may be similar to what you meant by reordering your data.
It can take a bit of work to implement this, however, because you have to consider cases such as what happens when it runs out of objects, how it knows which objects have been handed out, and so on.
Another method involves caching operations that work on sequential data, but aren’t performed sequentially. I think this is what you meant by having a separate vector.
However, it’s important to understand that your CPU doesn’t just keep one section of memory in cache at a time. It keeps multiple sections of memory cached.
If you’re jumping back and forth between operations on data in one section and operations on data in another section, this most likely will not cause many cache hits; your CPU can and should keep both sections cached at the same time.
If you’re jumping between operations on 50 different sets of data, you’ll probably encounter many cache misses. In this scenario, caching operations would be beneficial.
In your case, I don’t think caching operations will give you much benefit. Ensuring that all of your
Aobjects reside in the same section of memory, however, probably will.Another thing to consider is threading, but this can get pretty complicated. If your thread is doing a lot of context switches, you may encounter a lot of cache misses.