Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3628306
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T23:59:56+00:00 2026-05-18T23:59:56+00:00

Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b)

  • 0

Consider the following NEON-optimized function:

void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
    // Make sure "a" is mapped to registers in the d0-d15 range,
    // as requested by NEON multiply operations below:
    register float32x4_t a0 asm("q0") = a.val[0];
    register float32x4_t a1 asm("q1") = a.val[1];
    register float32x4_t a2 asm("q2") = a.val[2];
    register float32x4_t a3 asm("q3") = a.val[3];
    asm volatile (
    "\n\t# multiply two matrices...\n\t"
    "# result (%q0,%q1,%q2,%q3)  = first column of B (%q4) * first row of A (q0-q3)\n\t"
    "vmul.f32 %q0, %q4, %e8[0]\n\t"
    "vmul.f32 %q1, %q4, %e9[0]\n\t"
    "vmul.f32 %q2, %q4, %e10[0]\n\t"
    "vmul.f32 %q3, %q4, %e11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += second column of B (%q5) * second row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q5, %e8[1]\n\t"
    "vmla.f32 %q1, %q5, %e9[1]\n\t"
    "vmla.f32 %q2, %q5, %e10[1]\n\t"
    "vmla.f32 %q3, %q5, %e11[1]\n\t"
    "# result (%q0,%q1,%q2,%q3) += third column of B (%q6) * third row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q6, %f8[0]\n\t"
    "vmla.f32 %q1, %q6, %f9[0]\n\t"
    "vmla.f32 %q2, %q6, %f10[0]\n\t"
    "vmla.f32 %q3, %q6, %f11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += last column of B (%q7) * last row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q7, %f8[1]\n\t"
    "vmla.f32 %q1, %q7, %f9[1]\n\t"
    "vmla.f32 %q2, %q7, %f10[1]\n\t"
    "vmla.f32 %q3, %q7, %f11[1]\n\t\n\t"
    : "=&w"  (result.val[0]), "=&w"  (result.val[1]), "=&w"  (result.val[2]), "=&w" (result.val[3])
    : "w"   (b.val[0]),      "w"   (b.val[1]),      "w"   (b.val[2]),      "w"   (b.val[3]),
      "w"   (a0),            "w"   (a1),            "w"   (a2),            "w"   (a3)
    :
    );
}

Why does GCC 4.5 generate this abomination, for loading the first matrix:

vldmia  r1, {d0-d1}
vldr    d2, [r1, #16]
vldr    d3, [r1, #24]
vldr    d4, [r1, #32]
vldr    d5, [r1, #40]
vldr    d6, [r1, #48]
vldr    d7, [r1, #56]

…instead of just:

vldmia  r1, {q0-q3}

…?

options I use:

arm-none-eabi-gcc-4.5.1 -x c++ -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize

Note that using the iPhoneOS-provided compiler produces the same thing:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 -x c++ -arch armv7 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T23:59:57+00:00Added an answer on May 18, 2026 at 11:59 pm

    Simple answer:

    The GCC compiler is currently not very good at generating ARM code. If you look close to other code you’ll find out that GCC almost never arranges register that it can use multiple register loads / stores except of hard-coded places like function prolog/epilog and inline memcpy.

    When it comes to the use of the Neon instructions the code becomes even worse. This has something to do with the way the NEON unit works: You can treat register pairs either as quad or double-dwords. This is (as far as I know) a unique feature of register usage within GCC supported architectures. Therefore the code generator is not generating optimal code in all instances.

    Btw: While I’m at it: GCC has no idea that using the ‘free’ barrel-shifter feature on the Cortex-A8 has an important impact on the register scheduling, and GCC gets it mostly wrong.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Consider following example : public class SomeBusinessLayerService : DataService<MyEntityContainer> { [WebInvoke] void DoSomething(string someParam)
Consider following program: static void Main (string[] args) { int i; uint ui; i
Consider following code: main.cpp: #include <iostream> typedef void ( * fncptr)(void); extern void externalfunc(void);
Consider following example: #include <iostream> #include <functional> #include <algorithm> #include <vector> #include <boost/bind.hpp> const
Consider following example. #include <iostream> #include <algorithm> #include <vector> #include <boost/bind.hpp> void func(int e,
Consider following codes: #include <stdio.h> #include <malloc.h> void allocateMatrix(int **m, int l, int c)
Consider following JavaScript code (tested in Firefox): function f(a) { if (a == undefined)
Consider following function definition in ghci. let myF = sin . cos . sum
Consider following piece of code void foo( bool forwad ) { vector<MyObject>::iterator it, end_it;
Consider a following code: struct X { void MethodX() { ... } }; struct

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.