Consider the following NEON-optimized function:
void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
// Make sure "a" is mapped to registers in the d0-d15 range,
// as requested by NEON multiply operations below:
register float32x4_t a0 asm("q0") = a.val[0];
register float32x4_t a1 asm("q1") = a.val[1];
register float32x4_t a2 asm("q2") = a.val[2];
register float32x4_t a3 asm("q3") = a.val[3];
asm volatile (
"\n\t# multiply two matrices...\n\t"
"# result (%q0,%q1,%q2,%q3) = first column of B (%q4) * first row of A (q0-q3)\n\t"
"vmul.f32 %q0, %q4, %e8[0]\n\t"
"vmul.f32 %q1, %q4, %e9[0]\n\t"
"vmul.f32 %q2, %q4, %e10[0]\n\t"
"vmul.f32 %q3, %q4, %e11[0]\n\t"
"# result (%q0,%q1,%q2,%q3) += second column of B (%q5) * second row of A (q0-q3)\n\t"
"vmla.f32 %q0, %q5, %e8[1]\n\t"
"vmla.f32 %q1, %q5, %e9[1]\n\t"
"vmla.f32 %q2, %q5, %e10[1]\n\t"
"vmla.f32 %q3, %q5, %e11[1]\n\t"
"# result (%q0,%q1,%q2,%q3) += third column of B (%q6) * third row of A (q0-q3)\n\t"
"vmla.f32 %q0, %q6, %f8[0]\n\t"
"vmla.f32 %q1, %q6, %f9[0]\n\t"
"vmla.f32 %q2, %q6, %f10[0]\n\t"
"vmla.f32 %q3, %q6, %f11[0]\n\t"
"# result (%q0,%q1,%q2,%q3) += last column of B (%q7) * last row of A (q0-q3)\n\t"
"vmla.f32 %q0, %q7, %f8[1]\n\t"
"vmla.f32 %q1, %q7, %f9[1]\n\t"
"vmla.f32 %q2, %q7, %f10[1]\n\t"
"vmla.f32 %q3, %q7, %f11[1]\n\t\n\t"
: "=&w" (result.val[0]), "=&w" (result.val[1]), "=&w" (result.val[2]), "=&w" (result.val[3])
: "w" (b.val[0]), "w" (b.val[1]), "w" (b.val[2]), "w" (b.val[3]),
"w" (a0), "w" (a1), "w" (a2), "w" (a3)
:
);
}
Why does GCC 4.5 generate this abomination, for loading the first matrix:
vldmia r1, {d0-d1}
vldr d2, [r1, #16]
vldr d3, [r1, #24]
vldr d4, [r1, #32]
vldr d5, [r1, #40]
vldr d6, [r1, #48]
vldr d7, [r1, #56]
…instead of just:
vldmia r1, {q0-q3}
…?
options I use:
arm-none-eabi-gcc-4.5.1 -x c++ -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize
Note that using the iPhoneOS-provided compiler produces the same thing:
/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 -x c++ -arch armv7 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize
Simple answer:
The GCC compiler is currently not very good at generating ARM code. If you look close to other code you’ll find out that GCC almost never arranges register that it can use multiple register loads / stores except of hard-coded places like function prolog/epilog and inline memcpy.
When it comes to the use of the Neon instructions the code becomes even worse. This has something to do with the way the NEON unit works: You can treat register pairs either as quad or double-dwords. This is (as far as I know) a unique feature of register usage within GCC supported architectures. Therefore the code generator is not generating optimal code in all instances.
Btw: While I’m at it: GCC has no idea that using the ‘free’ barrel-shifter feature on the Cortex-A8 has an important impact on the register scheduling, and GCC gets it mostly wrong.