The game I’m converting is operating on 8-bit palette texture, and nearly every frame I have to update parts of that texture to OpenGL texture for rendering. It looks like this:
unsigned short RGB565PaletteLookupTable[256]; // Lookup table
unsigned char* Src; // Source data
unsigned short* Dst; // Destination buffer
int SrcPitch; // Source data row length
int OriginX, OriginY, Width, Height; // Subrectangle to copy
assert( Width % 4 == 0 );
int SrcOffset = SrcPitch-Width;
Src += OriginY*SrcPitch+OriginX;
int x, y;
for( y = OriginY; y < OriginY+Height; ++y, Src += SrcOffset )
{
for( x = OriginX; x < OriginX+Width; x += 4 )
{
*Dst++ = RGB565PaletteLookupTable[*Src++];
*Dst++ = RGB565PaletteLookupTable[*Src++];
*Dst++ = RGB565PaletteLookupTable[*Src++];
*Dst++ = RGB565PaletteLookupTable[*Src++];
}
}
This code takes 17% of main thread time during the game, so I’m looking for ways to speed it up. Data goes directly to glTexSubImage2D(), so I can’t change anything in destination buffer. It comes from code in the game which is ancient and not documented, and no one knows how it works anymore, so I can’t mess much with it either. The lookup table is provided by this ancient code as well, and can change during game.
Would it be possible to speed up this code using Accelerate framework / assembly instructions / any other means? I read examples of direct conversion of RGB888 to RGB565, but these didn’t need to use lookup tables. Where should I look to learn how to speed it up optimally?
UPDATE: I found that OriginX is also 4-aligned, and was able to refine the code in this way:
unsigned long RGB565PaletteLookupTable[256]; // Lookup table
unsigned char* Src; // Source data
unsigned long* Dst; // Destination buffer
int SrcPitch; // Source data row length
int OriginX, OriginY, Width, Height; // Subrectangle to copy
assert( Width % 4 == 0 );
int SrcOffset = SrcPitch-Width;
Src += OriginY*SrcPitch+OriginX;
SrcOffset >>= 2;
int x, y;
unsigned long* LSrc = (unsigned long*)Src;
for( y = OriginY; y < OriginY+Height; ++y, LSrc += SrcOffset )
{
for( x = OriginX; x < OriginX+Width; x += 4 )
{
unsigned long Indexes = *LSrc++;
unsigned long Result = RGB565PaletteLookupTable[ Indexes & 0xFF ];
Indexes >>= 8;
Result |= ( RGB565PaletteLookupTable[ Indexes & 0xFF ] << 16 );
*Dst++ = Result;
Indexes >>= 8;
Result = RGB565PaletteLookupTable[ Indexes & 0xFF ];
Indexes >>= 8;
Result |= ( RGB565PaletteLookupTable[ Indexes & 0xFF ] << 16 );
*Dst++ = Result;
}
}
This code doesn’t as far as I can tell, use any unaligned memory accesses. It improved performance a bit, that is, it now takes 15.5% of main thread time. I was hoping for more speedup though.
In theory, each one of there lookup table operations is independent from previous ones and subsequent ones (apart from the fact that each of them reads from the same lookup table), so I was expecting there would be some SIMD instruction, or perhaps assembly instructions that would allow to look-up many pixels in parallel. Something like
_mm_movemask_ps( _mm_cmpneq_ps( _mm_loadu_ps( cmp1 ), _mm_loadu_ps( cmp2 ) ) ) )
which on Macs does the same thing as memcmp( cmp1, cmp2, 16 ), only 8 times faster.
I’ll continue looking for it now.
UPDATE: I determined that there seems to be no way of speeding up the table lookup using NEON instruction set. The table needs to be 512-bytes big, there’s no way to fit it entirely in ARM registers, VTBX NEON instruction can process up to 32 bytes at a time, and it also assumes that the size of the lookup result must equal the size of the index. There’s something which might solve as a solution of similar problem described in http://forums.arm.com/index.php?/topic/15521-8bit-look-up-table-by-neon-code/ , but it won’t fit mine. So making sure the alignment of all operands is correct seems to be the best possible answer for this problem.
The problem is with the cache. You do a lot of reads from Src and if it is unaligned by four (which might be the case, since OriginX most likely is arbitrary) the (*Src++) wastes cycles on unaligned reads.
Try to enforce (OriginX % 4 == 0) and copy the remaining (OriginX % 4) pixels outside the main loop.
Same with “*Dst++ = ” – is Dst is unaligned, it is bad. Try to combine the RGB565 pairs (two sequential *Dst writes) into one 32-bit copy. You may even try to overwrite some more pixels to make the loop simpler and then handle the border pixels.
Hope you get the idea.
The second way: offload the conversion to GPU.
Create the 1D-texture for RGB565PaletteLookupTable and write a simple fragment shader which takes the (Src + RGB565PaletteLookupTable) and outputs the Dst (the glTexImage2D will then update the Src texture, not the Dst as you do now)