I got some vectors containing unsigned chars that represent pixels from a frame.
I got this function working without the MMX improvement, but I frustrated whit MMX that doesnt work … So:
I need to add two unsigned chars (the sum need to be done as a 16bit instead of a 8bit cause unsigned char goes from 0-255 as known) and divide them by two (shift right 1). The code I have done so far is below, but the values are wrong, the adds_pu16 doesnt add the 16bit just 8:
MM0 = _mm_setzero_si64(); //all zeros
MM1 = TO_M64(lv1+k); //first 8 unsigned chars
MM2 = TO_M64(lv2+k); //second 8 unsigned chars
MM3 =_mm_unpacklo_pi8(MM0,MM1); //get first 4chars from MM1 and add Zeros
MM4 =_mm_unpackhi_pi8(MM0,MM1); //get last 4chars from MM1 and add Zeros
MM5 =_mm_unpacklo_pi8(MM0,MM2); //same as above for line 2
MM6 =_mm_unpackhi_pi8(MM0,MM2);
MM1 = _mm_adds_pu16(MM3,MM5); //add both chars as a 16bit sum (255+255 max range)
MM2 = _mm_adds_pu16(MM4,MM6);
MM3 = _mm_srai_pi16(MM1,1); //right shift (division by 2)
MM4 = _mm_srai_pi16(MM2,1);
MM1 = _mm_packs_pi16(MM3,MM4); //pack the 2 MMX registers into one
v2 = TO_UCHAR(MM1); //put results in the destination array
New developments:
Thanks for that king_nak!!
I wrote a simple version of what I am trying to do:
int main()
{
char A[8]={255,155,2,3,4,5,6,7};
char B[8]={255,155,2,3,4,5,6,7};
char C[8];
char D[8];
char R[8];
__m64* pA=(__m64*) A;
__m64* pB=(__m64*) B;
__m64* pC=(__m64*) C;
__m64* pD=(__m64*) D;
__m64* pR=(__m64*) R;
_mm_empty();
__m64 MM0 = _mm_setzero_si64();
__m64 MM1 = _mm_unpacklo_pi8(*pA,MM0);
__m64 MM2 = _mm_unpackhi_pi8(*pA,MM0);
__m64 MM3 = _mm_unpacklo_pi8(*pB,MM0);
__m64 MM4 = _mm_unpackhi_pi8(*pB,MM0);
__m64 MM5 = _mm_add_pi16(MM1,MM3);
__m64 MM6 = _mm_add_pi16(MM2,MM4);
printf("SUM:\n");
*pC= _mm_add_pi16(MM1,MM3);
*pD= _mm_add_pi16(MM2,MM4);
for(int i=0; i<8; i++) printf("\t%d ", (C[i])); printf("\n");
for(int i=0; i<8; i++) printf("\t%d ", D[i]); printf("\n");
printf("DIV:\n");
*pC= _mm_srai_pi16(MM5,1);
*pD= _mm_srai_pi16(MM6,1);
for(int i=0; i<8; i++) printf("\t%d ", (C[i])); printf("\n");
for(int i=0; i<8; i++) printf("\t%d ", D[i]); printf("\n");
MM1= _mm_srai_pi16(MM5,1);
MM2= _mm_srai_pi16(MM6,1);
printf("Final Result:\n");
*pR= _mm_packs_pi16(MM1,MM2);
for(int i=0; i<8; i++) printf("\t%d ", (R[i])); printf("\n");
return(0);
}
And the results are:
SUM:
-2 1 54 1 4 0 6 0
8 0 10 0 12 0 14 0
DIV:
-1 0 -101 0 2 0 3 0
4 0 5 0 6 0 7 0
Final Result:
127 127 2 3 4 5 6 7
Well the small numbers are ok while the big numbers which give 127 are wrong. This is a problem, what am I doing wrong :s
I think i found the problem:
The arguments of the unpack instructions are in the wrong order. If you look at the registers as a whole, it looks like the individual chars are zero-extended to shorts, but in fact, they are zero-padded. Just swap around mm0 and the other register in each case and it should work.
Also, you don’t need saturated add, a normal PADDW is sufficient. The maximum value you will get is 0xff+0xff=0x01fe, which doesn’t have to be saturated.
Edit: What’s more, PACKSSWB doesn’t quite do what you want. PACKUSWB is the correct instruction, saturation will get you wrong results.
Here’s a solution (Also replaced the shifts with logical ones and used different pseudo-registers in some places):