How do you efficiently transpose a matrix? Are there libraries for this, or what algorithm would you use?
E.g.:
short src[W*H] = {
{1,2,3},
{4,5,6}
};
short dest[W*H];
rotate_90_clockwise(dest,src,W,H); //<-- magic in here, no need for in-place
//dest is now:
{
{4, 1},
{5, 2},
{6, 3}
};
(In my specific case its src array is raw image data, and the destination is a framebuffer, and I’m embedded on ARM on a toolchain that doesn’t support assembly)
There are libraries for this, in some cases. And, notably, there are tricks you can play with vectorized data (e.g., four 32-bit elements in a 128-bit vector, but this also applies to four 8-bit bytes in a 32-bit register) to go faster than individual-element accesses.
For a transpose, the standard idea is that you use “shuffle” instructions, which allow you to create a new data vector out of two existing vectors, in any order. You work with 4×4 blocks of the input array. So, starting out, you have:
Then, you apply shuffle instructions to the first two vectors (interleaving their odd elements, A0B0 C0D0 -> ABCD, and interleaving their even elements, 0A0B 0C0D -> ABCD), and to the last two, to create a new set of vectors with each 2×2 block transposed:
Finally, you apply shuffle instructions to the odd pair and the even pair (combining their first pairs of elements, AB00 CD00 -> ABCD, and their last pairs, 00AB 00CD -> ABCD), to get:
And there, 16 elements transposed in eight instructions!
Now, for 8-bit bytes in 32-bit registers, ARM doesn’t have exactly shuffle instructions, but you can synthesize what you need with shifts and a SEL (select) instruction, and the second set of shuffles you can do in one instruction with the PKHBT (pack halfword bottom top) and PKHTB (pack halfword top bottom) instructions.
Finally, if you’re using a large ARM processor with NEON vectorizations, you can do something like this with 16-element vectors on 16×16 blocks.