Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7719925
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T03:32:12+00:00 2026-06-01T03:32:12+00:00

I am optimizing an algorithm in ARM assembly and need to figure out in

  • 0

I am optimizing an algorithm in ARM assembly and need to figure out in which order to place the instructions to minimize pipeline stalls. The cycle counter at http://pulsar.webshaker.net/ccc/index.php?lng=us is very useful in this, but lacks knowledge about what happens on function calls/branches. What I want to do is basically (this is just an example):

mul       r4, r0, r1
mov       r0, #0
mov       r1, #12
mov       r4, r4, ASR #14
str       r4, [r5]
bl        foo

The pipeline stall between the mul and mov instructions is quite horrible, and there is nothing stopping me from doing the function call between them. But what exactly happens with the pipeline when I do the branch? I know that foo will do push {r4-r12, lr} as it’s first instruction. I can see two possible outcomes:

  1. The branch instruction takes a few cycles which enables the mul instruction to deliver its result before push is performed, thereby reducing the pipeline stall.
  2. The pipeline stall is increased since push needs r4 a few cycles before it is executed (this was the case before ARMv7 IIRC, the cycle counter in the link does not seem to think this is needed).

In short:
What happens with instructions with delayed results (mul being the prime example) when you do a function call (which is assumed to push the register on the stack) or even a normal branch?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T03:32:13+00:00Added an answer on June 1, 2026 at 3:32 am

    If I understand you don’t need to execute

    mov       r4, r4, ASR #14
    str       r4, [r5]
    

    before the call.
    Doing the call before the mov

    bl        foo
    mov       r4, r4, ASR #14
    str       r4, [r5]
    

    is a good idea.

    The mul will have more time to finish during the call.
    the STM will be a problem that’s clear. Of course you can push R4 before it’s computed.

    If foo is an asm function, you can save R4 later in the foo function (may be you can try to not use r4 and then not save it).

    if foo function is a C function (or if you can change the push instruction). use r12 instead of r4 as the destination register of the MUL.

    R12 will be needed later by the STM instruction. Then it is possible that the mul have enough time to finish before the destination register (R12) is needed by STM !

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The Situation: I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves
I need help optimizing the code to run faster, unless it is optimized the
I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance,
I need to read from a dataset which is very large, highly interlinked, the
Optimizing a game we're developing, we're running into the phase where every CPU cycle
I need help optimizing this query: SELECT c.rut, c.nombre, c.apellido, c.estado, c.porcentajeavance, c.porcentajenota, c.nota,
I've been optimizing a query on my test server, which has the same indexes
I need some help optimizing the following method. The queries have become too costly
Been optimizing an algorithm and came down to the last part. I have an
I need a good stemming algorithm for a project I'm working on. It was

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.