I am performing a scattered read of 8-bit data from a file (De-Interleaving a

Question

0

Asked: May 13, 20262026-05-13T08:15:46+00:00 2026-05-13T08:15:46+00:00

I am performing a scattered read of 8-bit data from a file (De-Interleaving a

0

I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I’m having is with my re-construction of the data to write out.

Basically I’m reading in 16 bytes and then building them into a single __m128i variable and then using _mm_stream_ps to write the value back out to memory. However I have some odd performance results.

In my first scheme I use the _mm_set_epi8 intrinsic to set my __m128i as follows:

    const __m128i packedSamples = _mm_set_epi8( sample15,   sample14,   sample13,   sample12,   sample11,   sample10,   sample9,    sample8,
                                                sample7,    sample6,    sample5,    sample4,    sample3,    sample2,    sample1,    sample0 );

Basically I leave it all up to the compiler to decide how to optimise it to give best performance. This gives WORST performance. MY test runs in ~0.195 seconds.

Second I tried to merge down by using 4 _mm_set_epi32 instructions and then packing them down:

    const __m128i samples0      = _mm_set_epi32( sample3, sample2, sample1, sample0 );
    const __m128i samples1      = _mm_set_epi32( sample7, sample6, sample5, sample4 );
    const __m128i samples2      = _mm_set_epi32( sample11, sample10, sample9, sample8 );
    const __m128i samples3      = _mm_set_epi32( sample15, sample14, sample13, sample12 );

    const __m128i packedSamples0    = _mm_packs_epi32( samples0, samples1 );
    const __m128i packedSamples1    = _mm_packs_epi32( samples2, samples3 );
    const __m128i packedSamples     = _mm_packus_epi16( packedSamples0, packedSamples1 );

This does improve performance somewhat. My test now runs in ~0.15 seconds. Seems counter-intuitive that performance would improve by doing this as I assume this is exactly what _mm_set_epi8 is doing anyway …

My final attempt was to use a bit of code I have from making four CCs the old fashioned way (with shifts and ors) and then putting them in an __m128i using a single _mm_set_epi32.

    const GCui32 samples0       = MakeFourCC( sample0, sample1, sample2, sample3 );
    const GCui32 samples1       = MakeFourCC( sample4, sample5, sample6, sample7 );
    const GCui32 samples2       = MakeFourCC( sample8, sample9, sample10, sample11 );
    const GCui32 samples3       = MakeFourCC( sample12, sample13, sample14, sample15 );
    const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );

This gives even BETTER performance. Taking ~0.135 seconds to run my test. I’m really starting to get confused.

So I tried a simple read byte write byte system and that is ever-so-slightly faster than even the last method.

So what is going on? This all seems counter-intuitive to me.

I’ve considered the idea that the delays are occuring on the _mm_stream_ps because I’m supplying data too quickly but then I would to get exactly the same results out whatever I do. Is it possible that the first 2 methods mean that the 16 loads can’t get distributed through the loop to hide latency? If so why is this? Surely an intrinsic allows the compiler to make optimisations as and where it pleases .. i thought that was the whole point … Also surely performing 16 reads and 16 writes will be much slower than 16 reads and 1 write with a bunch of SSE juggling instructions … After all its the reads and writes that are the slow bit!

Anyone with any ideas whats going on will be much appreciated! 😀

Edit: Further to the comment below I stopped pre-loading the bytes as constants and changedit to this:

    const __m128i samples0      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples1      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples2      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples3      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;

    const __m128i packedSamples0    = _mm_packs_epi32( samples0, samples1 );
    const __m128i packedSamples1    = _mm_packs_epi32( samples2, samples3 );
    const __m128i packedSamples     = _mm_packus_epi16( packedSamples0, packedSamples1 );

and this improved performance to ~0.143 seconds. Sitll not as good as the straight C implementation …

Edit Again: The best performance I’m getting thus far is

    // Load the samples.
    const GCui8 sample0     = *(pSamples + channelStep0);
    const GCui8 sample1     = *(pSamples + channelStep1);
    const GCui8 sample2     = *(pSamples + channelStep2);
    const GCui8 sample3     = *(pSamples + channelStep3);

    const GCui32 samples0   = Build32( sample0, sample1, sample2, sample3 );
    pSamples += channelStep4;

    const GCui8 sample4     = *(pSamples + channelStep0);
    const GCui8 sample5     = *(pSamples + channelStep1);
    const GCui8 sample6     = *(pSamples + channelStep2);
    const GCui8 sample7     = *(pSamples + channelStep3);

    const GCui32 samples1   = Build32( sample4, sample5, sample6, sample7 );
    pSamples += channelStep4;

    // Load the samples.
    const GCui8 sample8     = *(pSamples + channelStep0);
    const GCui8 sample9     = *(pSamples + channelStep1);
    const GCui8 sample10    = *(pSamples + channelStep2);
    const GCui8 sample11    = *(pSamples + channelStep3);

    const GCui32 samples2       = Build32( sample8, sample9, sample10, sample11 );
    pSamples += channelStep4;

    const GCui8 sample12    = *(pSamples + channelStep0);
    const GCui8 sample13    = *(pSamples + channelStep1);
    const GCui8 sample14    = *(pSamples + channelStep2);
    const GCui8 sample15    = *(pSamples + channelStep3);

    const GCui32 samples3   = Build32( sample12, sample13, sample14, sample15 );
    pSamples += channelStep4;

    const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );

    _mm_stream_ps( pWrite + 0,  *(__m128*)&packedSamples );

This gives me processing in ~0.095 seconds which is considerably better. I don’t appear to be able to get close with SSE though … I’m still confused by that but .. ho hum.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T08:15:47+00:00

Perhaps the compiler is trying to put all the arguments to the intrinsic into registers at once. You don’t want to access that many variables at once without organizing them.

Rather than declare a separate identifier for each sample, try putting them into a char[16]. The compiler will promote the 16 values to registers as it sees fit, as long as you don’t take the address of anything within the array. You can add an __aligned__ tag (or whatever VC++ uses) and maybe avoid the intrinsic altogether. Otherwise, calling the intrinsic with ( sample[15], sample[14], sample[13] … sample[0] ) should make the compiler’s job easier or at least do no harm.

Edit: I’m pretty sure you’re fighting a register spill but that suggestion will probably just store the bytes individually, which isn’t what you want. I think my advice is to interleave your final attempt (using MakeFourCC) with the read operations, to make sure it’s scheduled correctly and with no round-trips to the stack. Of course, inspection of object code is the best way to ensure that.

Essentially, you are streaming data into the register file and then streaming it back out. You don’t want to overload it before it’s time to flush the data.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am performing a scattered read of 8-bit data from a file (De-Interleaving a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply