I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin.
Assumptions / pre-conditions:
wis always 320 (multiple of 16/32).paandpbare 16-byte alignedmaandmbare positive.
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
int sum=0;
do {
sum += ((*pa++)-ma)*((*pb++)-mb);
} while(--w);
return sum;
}
This attempt at vectorizing it is not working well, and isn’t safe (missing clobbers), but demonstrates what I’m trying to do:
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
asm volatile("lsr %2, %2, #3 \n"
".loop: \n"
"# load 8 elements: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"vld4.8 {d4-d7}, [%2]! \n"
"# do the operation: \n"
"vaddl.u8 q7, d0, r7 \n"
"vaddl.u8 q8, d1, d8 \n"
"vmlal.u8 q7, q7, q8 \n"
"# Sum the vector a save in sum (this is wrong):\n"
"vaddl.u8 q7, d0, r7 \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne .loop \n" // Repeat unil iteration count is not zero
:
: "r"(pa), "r"(pb), "r"(w),"r"(ma),"r"(mb),"r"(sum)
: "r4", "r5", "r6","r7","r8","r9"
);
return sum;
}
Here is a simple NEON implementation. I have tested this against the scalar code to make sure that it works. Note that for best performance both
paandpbshould be 16 byte aligned.