The code for x86 does this (n can only be 1 through 4, unknown at compile time):
static const uint32_t wordmask[] = {
0u, 0xffu, 0xffffu, 0xffffffu, 0xffffffffu
};
static inline uint32_t get_unaligned_le_x86(const void *p, uint32_t n) {
uint32_t ret = *(const uint32_t *)p & wordmask[n];
return ret;
}
For architectures that don’t have unaligned 32bit little endian loads I have two variants:
static uint32_t get_unaligned_le_v1(const void *p, uint32_t n) {
const uint8_t *b = (const uint8_t *)p;
uint32_t ret;
ret = b[0];
if (n > 1) {
ret |= b[1] << 8;
if (n > 2) {
ret |= b[2] << 16;
if (n > 3) {
ret |= b[3] << 24;
}
}
}
return ret;
}
static uint32_t get_unaligned_le_v2(const void *p, uint32_t n) {
const uint8_t *b = (const uint8_t *)p;
uint32_t ret = b[0] | (b[1] << 8) | (b[2] << 16) | (b[3] << 24);
ret &= wordmask[n];
return ret;
}
Which would be better on read hardware (I’m using qemu for development) and can you suggest a faster alternative? If it’s much faster, I’m game with using assembly.
Conditional execution on the ARM is your best bet for improved performance. Table lookups (masks) will definitely be slower on ARM. Here is my ARMv5 implementation:
Update: fixed ldreqb to be ldrgeb
Update 2: shaved off another cycle by inserting an instruction between last ldr/orr