I’m working on an embedded device that does not support unaligned memory accesses.
For a video decoder I have to process pixels (one byte per pixel) in 8×8 pixel blocks. The device has some SIMD processing capabilities that allow me to work on 4 bytes in parallel.
The problem is, that the 8×8 pixel blocks aren’t guaranteed to start on an aligned address and the functions need to read/write up to three of these 8×8 blocks.
How would you approach this if you want very good performance? After a bit of thinking I came up with the following three ideas:
-
Do all memory accesses as bytes. This is the easiest way to do it but slow and it does not work well with the SIMD capabilites (it’s what I’m currently do in my reference C-code).
-
Write four copy-functions (one for each alignment case) that load the pixel-data via two 32-bit reads, shift the bits into the correct position and write the data to some aligned chunk of scratch memory. The video processing functions can then use 32 bit accesses and SIMD. Drawback: The CPU will have no chance to hide the memory latency behind the processing.
-
Same idea as above, but instead of writing the pixels to scratch memory do the video-processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is high (around 60 I guess).
Btw: I will have to write all functions in assembler because the compiler generates horrible code when it comes to the SIMD extension.
Which road would you take, or do you have another idea how to approach this?
You should first break your code into fetch/processing sections.
The fetch code should copy into a working buffer and have cases for for memory that is aligned (where you should be able to copy using the SIMD registers) and non-aligned memory where you need to copy byte by byte (if your platform can’t do unaligned access, and your source/dest have different alignments, then this is the best you can do).
Your processing code can then be SIMD with the guarantee of working on aligned data. For any real degree of processing doing a copy+process will definitely be faster than non-SIMD operations on unaligned data.
Assuming your source & dest are the same, a further optimization would be to only use the working buffer if the source is unaligned, and do the processing in-place if the memory’s aligned. The benefits of this will depend upon the characteristics of your data.
Depending on your architecture you may get further benefits by prefetching data before processing. This is where you can issue instructions to fetch areas of memory into the cache before they’re needed, so you would issue a fetch for the next block before processing the current.