how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t

Question

0

Asked: May 15, 20262026-05-15T18:56:30+00:00 2026-05-15T18:56:30+00:00

how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t

0

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

Help!!!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T18:56:30+00:00

Simply said the vmla instruction does the following:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

And all this compiles into a singe assembler instruction 🙂

You can use this NEON-assembler intrinsic among other things in typical 4×4 matrix multiplications for 3D-graphics like this:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

This saves a couple of cycles because you don’t have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply