Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b)

Question

0

Asked: May 18, 20262026-05-18T23:59:56+00:00 2026-05-18T23:59:56+00:00

Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b)

0

Consider the following NEON-optimized function:

void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
    // Make sure "a" is mapped to registers in the d0-d15 range,
    // as requested by NEON multiply operations below:
    register float32x4_t a0 asm("q0") = a.val[0];
    register float32x4_t a1 asm("q1") = a.val[1];
    register float32x4_t a2 asm("q2") = a.val[2];
    register float32x4_t a3 asm("q3") = a.val[3];
    asm volatile (
    "\n\t# multiply two matrices...\n\t"
    "# result (%q0,%q1,%q2,%q3)  = first column of B (%q4) * first row of A (q0-q3)\n\t"
    "vmul.f32 %q0, %q4, %e8[0]\n\t"
    "vmul.f32 %q1, %q4, %e9[0]\n\t"
    "vmul.f32 %q2, %q4, %e10[0]\n\t"
    "vmul.f32 %q3, %q4, %e11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += second column of B (%q5) * second row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q5, %e8[1]\n\t"
    "vmla.f32 %q1, %q5, %e9[1]\n\t"
    "vmla.f32 %q2, %q5, %e10[1]\n\t"
    "vmla.f32 %q3, %q5, %e11[1]\n\t"
    "# result (%q0,%q1,%q2,%q3) += third column of B (%q6) * third row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q6, %f8[0]\n\t"
    "vmla.f32 %q1, %q6, %f9[0]\n\t"
    "vmla.f32 %q2, %q6, %f10[0]\n\t"
    "vmla.f32 %q3, %q6, %f11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += last column of B (%q7) * last row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q7, %f8[1]\n\t"
    "vmla.f32 %q1, %q7, %f9[1]\n\t"
    "vmla.f32 %q2, %q7, %f10[1]\n\t"
    "vmla.f32 %q3, %q7, %f11[1]\n\t\n\t"
    : "=&w"  (result.val[0]), "=&w"  (result.val[1]), "=&w"  (result.val[2]), "=&w" (result.val[3])
    : "w"   (b.val[0]),      "w"   (b.val[1]),      "w"   (b.val[2]),      "w"   (b.val[3]),
      "w"   (a0),            "w"   (a1),            "w"   (a2),            "w"   (a3)
    :
    );
}

Why does GCC 4.5 generate this abomination, for loading the first matrix:

vldmia  r1, {d0-d1}
vldr    d2, [r1, #16]
vldr    d3, [r1, #24]
vldr    d4, [r1, #32]
vldr    d5, [r1, #40]
vldr    d6, [r1, #48]
vldr    d7, [r1, #56]

…instead of just:

vldmia  r1, {q0-q3}

…?

options I use:

arm-none-eabi-gcc-4.5.1 -x c++ -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize

Note that using the iPhoneOS-provided compiler produces the same thing:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 -x c++ -arch armv7 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T23:59:57+00:00

Simple answer:

The GCC compiler is currently not very good at generating ARM code. If you look close to other code you’ll find out that GCC almost never arranges register that it can use multiple register loads / stores except of hard-coded places like function prolog/epilog and inline memcpy.

When it comes to the use of the Neon instructions the code becomes even worse. This has something to do with the way the NEON unit works: You can treat register pairs either as quad or double-dwords. This is (as far as I know) a unique feature of register usage within GCC supported architectures. Therefore the code generator is not generating optimal code in all instances.

Btw: While I’m at it: GCC has no idea that using the ‘free’ barrel-shifter feature on the Cortex-A8 has an important impact on the register scheduling, and GCC gets it mostly wrong.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply