As title suggests, I had problems with a program of mine where I used a std::list as a stack and also to iterate over all elements of the list. The program was taking way too long when the lists became very big.
Does anyone have a good explanation for this? Is it some stack/cache behavior?
(Solved the problem by changing the lists to std::vector and std::deque (an amazing data structure by the way) and everything suddenly went so much faster)
EDIT: I’m not a fool and I don’t access elements in the middle of the lists. The only thing I did with the lists was to remove/add elements at the end/beginning and to iterate through all elements of the list.
And I always used iterators to iterate over the list.
Lists have terrible (nonexistent) cache locality. Every node is a new memory allocation, and may be anywhere. So every time you follow a pointer from one node to the next, you jump to a new, unrelated, place in memory. And yes, that hurts performance quite a bit. A cache miss may be two orders of magnitudes slower than a cache hit. In a vector or deque, pretty much every access will be a cache hit. A vector is one single contiguous block of memory, so iterating over that is as fast as you’re going to get. A deque is several smaller blocks of memory, so it introduces the occasional cache miss, but they’ll still be rare, and iteration will still be very fast as you’re getting mostly cache hits.
A list will be almost all cache misses. And performance will suck.
In practice, a linked list is hardly ever the right choice from a performance point of view.
Edit:
As a comment pointed out, another problem with lists is data dependencies. A modern CPU likes to overlap operations. But it can’t do that if the next instruction depends on the result of this one.
If you’re iterating over a vector, that’s no problem. You can compute the next address to read on the fly, without ever having to check in memory. If you’re reading at address
xnow, then the next element will be located at addressx + sizeof(T)where T is the element type. So there are no dependencies there, and the CPU can start loading the next element, or the one after it, immediately, while still processing an earlier element. That way, the data will be ready for us when we need it, and this further helps mask the cost of accessing data in RAM.In a list, we need to follow a pointer from node
ito nodei+1, and untili+1has been loaded, we don’t even know where to look fori+2. We have a data dependency, so the CPU is forced to read nodes one at a time, and it can’t start reading future nodes ahead of time, because it doesn’t yet know where they are.If a list hadn’t been all cache misses, this wouldn’t have been a big problem, but since we’re getting a lot of cache misses, these delays are costly.