I have written a program to process some data written to disk in big-endian format, so the program needs to swap bytes in order to do anything else. After profiling the code I found that my byte swapping function was taking 30% of the execution time. So I thought to myself, how can I speed this up?
So I decided to write a little piece inline assembly.
I would up replacing this:
void swapTwoByte(char* array, int numChunks)
{
for(int i= (2*numChunks-1); i>=0; i-=2)
{
char temp=array[i];
array[i]=array[i-1];
array[i-1]=temp;
}
}
with this:
void swapTwoByte(int16* array, int numChunks)
{
for(int i= (numChunks-1); i>=0; --i)
{
asm("movw %1, %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"rorw %%ax;"
"movw %%ax, %0;"
: "=r" ( array[i] )
: "r" (array[i])
:"%ax"
);
}
}
Which does the intended job, but that is a lot of rotate operations.
So here is my question:
According to this source rorw can take two operands, and in the gas sytax the source operand should be the number of bits to rotate by, but every time I try to replace that list of 8 rotate rights with something like
".set rotate, 0x0008"
"rorw rotate, %%ax"
I get an assembler error stating:
"Error: number of operands mismatch for `ror'"
Why is this? What am I missing?
First of all, use
This will compile into optimal code on whatever system you are using, and it will even work if you happen to port your code to a big-endian platform.
However, this will not fix your performance problem because I believe you have misidentified the problem. Nemo’s first rule of micro-optimization: “Math is fast; memory is slow”.
Iterating through a large block of memory and swapping its bytes is extremely cache-unfriendly. A byte swap is one cycle; a memory read or write is hundreds of cycles unless it hits in the cache.
So do not swap the bytes until you use them. My personal favorite approach is this:
This defines a two-byte class that represents a big-endian number in memory. It implicitly casts to and from uint16_t as needed. So cast your memory pointer to a
be_uint16 *and just access it like an array; forget about the byte swapping because the class will do it for you:Note that you can even do things like this:
The overhead of swapping a value immediately before use is, in my experience, undetectable. Locality is the name of the game…