I am optimizing an algorithm in ARM assembly and need to figure out in which order to place the instructions to minimize pipeline stalls. The cycle counter at http://pulsar.webshaker.net/ccc/index.php?lng=us is very useful in this, but lacks knowledge about what happens on function calls/branches. What I want to do is basically (this is just an example):
mul r4, r0, r1
mov r0, #0
mov r1, #12
mov r4, r4, ASR #14
str r4, [r5]
bl foo
The pipeline stall between the mul and mov instructions is quite horrible, and there is nothing stopping me from doing the function call between them. But what exactly happens with the pipeline when I do the branch? I know that foo will do push {r4-r12, lr} as it’s first instruction. I can see two possible outcomes:
- The branch instruction takes a few cycles which enables the
mulinstruction to deliver its result beforepushis performed, thereby reducing the pipeline stall. - The pipeline stall is increased since
pushneedsr4a few cycles before it is executed (this was the case before ARMv7 IIRC, the cycle counter in the link does not seem to think this is needed).
In short:
What happens with instructions with delayed results (mul being the prime example) when you do a function call (which is assumed to push the register on the stack) or even a normal branch?
If I understand you don’t need to execute
before the call.
Doing the call before the mov
is a good idea.
The mul will have more time to finish during the call.
the STM will be a problem that’s clear. Of course you can push R4 before it’s computed.
If foo is an asm function, you can save R4 later in the foo function (may be you can try to not use r4 and then not save it).
if foo function is a C function (or if you can change the push instruction). use r12 instead of r4 as the destination register of the MUL.
R12 will be needed later by the STM instruction. Then it is possible that the mul have enough time to finish before the destination register (R12) is needed by STM !