Assuming something like: void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned

Question

0

Asked: May 27, 20262026-05-27T12:33:52+00:00 2026-05-27T12:33:52+00:00

Assuming something like: void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned

0

Assuming something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
}

I can go faster on a non-aligned access machine (e.g. x86) by writing something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;
  for(i=0; i<wordlen; i++)
  {
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
}

However it needs to build on several architectures so I would like to do something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;

#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
  // go slow
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
#else
  // go fast
  for(i=0; i<wordlen; i++)
  {
    // the following line will raise SIGBUS on SPARC and other archs that require aligned access.
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; 
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
#endif
}

But I cannot find any good information on compiler defined macros (like my hypothetical __ALIGNED4__ above) that specify alignment or any clever ways of using the pre-processor to determine target architecture alignment. I could just test defined (__SVR4) && defined (__sun), but I would prefer something that will Just Work^{^_TM} on other architectures requiring aligned memory accesses.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T12:33:53+00:00

While x86 silently fixes up unaligned accesses, this is hardly optimal for performance. It is usually best to assume a certain alignment and perform fixups yourself:

unsigned int const alignment = 8;   /* or 16, or sizeof(long) */

void memcpy(char *dst, char const *src, unsigned int size) {
    if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
        /* no common alignment, copy as bytes or shift around */
    } else {
        if(((intptr_t)dst) % alignment) {
            /* copy bytes at the beginning */
        }
        /* copy words in the middle */
        if(((intptr_t)dst + size) % alignment) {
            /* copy bytes at the end */
        }
    }
}

Also, take a look at SIMD instructions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Assuming something like: void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply