I am optimizing an algorithm in ARM assembly and need to figure out in

Question

0

Asked: June 1, 20262026-06-01T03:32:12+00:00 2026-06-01T03:32:12+00:00

I am optimizing an algorithm in ARM assembly and need to figure out in

0

I am optimizing an algorithm in ARM assembly and need to figure out in which order to place the instructions to minimize pipeline stalls. The cycle counter at http://pulsar.webshaker.net/ccc/index.php?lng=us is very useful in this, but lacks knowledge about what happens on function calls/branches. What I want to do is basically (this is just an example):

mul       r4, r0, r1
mov       r0, #0
mov       r1, #12
mov       r4, r4, ASR #14
str       r4, [r5]
bl        foo

The pipeline stall between the mul and mov instructions is quite horrible, and there is nothing stopping me from doing the function call between them. But what exactly happens with the pipeline when I do the branch? I know that foo will do push {r4-r12, lr} as it’s first instruction. I can see two possible outcomes:

The branch instruction takes a few cycles which enables the mul instruction to deliver its result before push is performed, thereby reducing the pipeline stall.
The pipeline stall is increased since push needs r4 a few cycles before it is executed (this was the case before ARMv7 IIRC, the cycle counter in the link does not seem to think this is needed).

In short:
What happens with instructions with delayed results (mul being the prime example) when you do a function call (which is assumed to push the register on the stack) or even a normal branch?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T03:32:13+00:00

If I understand you don’t need to execute

mov       r4, r4, ASR #14
str       r4, [r5]

before the call.
Doing the call before the mov

bl        foo
mov       r4, r4, ASR #14
str       r4, [r5]

is a good idea.

The mul will have more time to finish during the call.
the STM will be a problem that’s clear. Of course you can push R4 before it’s computed.

If foo is an asm function, you can save R4 later in the foo function (may be you can try to not use r4 and then not save it).

if foo function is a C function (or if you can change the push instruction). use r12 instead of r4 as the destination register of the MUL.

R12 will be needed later by the STM instruction. Then it is possible that the mul have enough time to finish before the destination register (R12) is needed by STM !

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am optimizing an algorithm in ARM assembly and need to figure out in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply