Why can the same assembler operation (mul for example) in different parts of a program consume different amount of time?
P.S. I’m using C++ and disassembler.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
There are all kinds reasons why the same kind of operation can have massively varying performance on modern processors.
Data Cache Misses:
If your operation accesses memory it might go to the cache at one location and generate a cache miss elsewhere. Cache misses can be in the order of a hundret cycles, while easy operations often execute in a few cycles, so this will make it much slower.
Pipeline Stalls:
Modern CPUs are typically pipelined, so an instruction (or more then one) can be scheduled each cycle, but they typically need more than one cycle till the result is available. Your operation might depend on the result of another operation, which isn’t ready when the operation is scheduled, so the CPU has to wait till the operation generating the result has finished.
Instruction Cache Misses:
The instruction stream is also cached, so you might find a situation where for one location the cpu generates a cache miss each time it encounteres that location (unlikely for anything which will take a measurable amount of the runtime though, instruction caches aren’t that small).
Branch Misprediction:
Another kind of pipeline stall. The CPU will try to predict which way a conditional jump will go and speculatively execute the code in that execution path. If it is wrong it has to discard the results from this speculative execution and start on the other path. This might show up on the first line of the other path in a profiler.
Resource Contention: The operation might not depend a not avalible result, but the execution unit needed might still be occupied by another instruction (some instructions are not fully pipelined on all processors, or it might be because of some kind of Hyperthreading or Bulldozers shared FPU). Again the CPU might have to stall until the unit is free.
Page Faults: Should be pretty obvious. Basically a Cache Miss on steroids. If the accessed memory has to be reloaded from disk it will cost hundreds of thousands of cycles
…: The list goes on, however the mentioned points are the ones most likely to make an impact in my opionon.