how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?
Help!!!
Simply said the vmla instruction does the following:
And all this compiles into a singe assembler instruction 🙂
You can use this NEON-assembler intrinsic among other things in typical 4×4 matrix multiplications for 3D-graphics like this:
This saves a couple of cycles because you don’t have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).
Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.
Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.