The code for x86 does this (n can only be 1 through 4, unknown

Question

0

Asked: May 30, 20262026-05-30T21:48:44+00:00 2026-05-30T21:48:44+00:00

The code for x86 does this (n can only be 1 through 4, unknown

0

The code for x86 does this (n can only be 1 through 4, unknown at compile time):

static const uint32_t wordmask[] = {
  0u, 0xffu, 0xffffu, 0xffffffu, 0xffffffffu
};
static inline uint32_t get_unaligned_le_x86(const void *p, uint32_t n) {
  uint32_t ret = *(const uint32_t *)p & wordmask[n];
  return ret;
}

For architectures that don’t have unaligned 32bit little endian loads I have two variants:

static uint32_t get_unaligned_le_v1(const void *p, uint32_t n) {
  const uint8_t *b = (const uint8_t *)p;
  uint32_t ret;
  ret = b[0];
  if (n > 1) {
    ret |= b[1] << 8;
    if (n > 2) {
      ret |= b[2] << 16;
      if (n > 3) {
        ret |= b[3] << 24;
      }
    }
  }
  return ret;
}

static uint32_t get_unaligned_le_v2(const void *p, uint32_t n) {
  const uint8_t *b = (const uint8_t *)p;
  uint32_t ret = b[0] | (b[1] << 8) | (b[2] << 16) | (b[3] << 24);
  ret &= wordmask[n];
  return ret;
}

Which would be better on read hardware (I’m using qemu for development) and can you suggest a faster alternative? If it’s much faster, I’m game with using assembly.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T21:48:45+00:00

Conditional execution on the ARM is your best bet for improved performance. Table lookups (masks) will definitely be slower on ARM. Here is my ARMv5 implementation:

// When called from C, r0 = first parameter, r1 = second parameter
// r0-r3 and r12 can get trashed by C functions
unaligned_read:
  ldrb r2,[r0],#1      ; byte 0 is always read (n=1..4)
  cmp r1,#2
  ldrgeb r3,[r0],#1   ; byte 1, n >= 2
  ldrgtb r12,[r0],#1  ; byte 2, n > 2
  orrge r2,r2,r3,LSL #8
  orrgt r2,r2,r12,LSL #16
  cmp r1,#4
  ldreqb r3,[r0],#1   ; byte 3, n == 4
  movne r0,r2         ; recoup wasted cycle
  orreq r0,r2,r3,LSL #24
  mov pc,lr           ; or "bx lr" for thumb compatibility

Update: fixed ldreqb to be ldrgeb

Update 2: shaved off another cycle by inserting an instruction between last ldr/orr

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The code for x86 does this (n can only be 1 through 4, unknown

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply