I’m writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I’m doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
– I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
– I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can’t get the function per se to run much faster than what I’m seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
UPDATE
That still doesn’t show much, because we cannot know how often the condition
x > thresholdwill be true. Ifx > thresholdis very frequently true, then thestd::setmight be the bottleneck, because it has to do a dynamic memory allocation for everyuintyou insert.Also we don’t know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don’t know how you need to access the result.
Anyway, on a hunch:
If you can use the result in that form, you’re done. Otherwise you could do some post-processing. Just don’t copy it into a
std::setagain (obviously). Try to stick tostd::vector<POD>. E.g. you could build an index into the vectors like this:ps.: I’m almost sure your loop is not memory-bound. Can’t be sure though… if the "nodes" you’re not showing us are really big it might still be.
Original answer:
There is no easy
I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.You can do some things though.
Make sure the compiler can "see" the implementation of every function that’s called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as
inline.This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don’t call virtual functions and don’t use function pointers. And or course don’t allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don’t use
intifsigned charwill do the job. That’s something I wouldn’t normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.If you’re just copying or filling memory, use
memcpyormemset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.Make sure you access the data in a cache-friendly manner. The optimum is "streaming" – i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don’t jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that’s done depends on the compiler). If you don’t know what alias-free means I recommend searching the net and your compiler’s documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.