Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature:
PUNPCKLBW xmm1, xmm2/m128
Now this doesn’t seem to be possible at all with intrinsics. It looks like it’s mandatory to use _mm_load* intrinsics to read anything in memory. This is the intrinsic for PUNPCKLBW:
__m128i _mm_unpacklo_epi8 (__m128i a, __m128i b);
(As far as I know, the __m128i type always refers to an XMM register.)
Now, why is this? It’s rather sad since I see some optimization potential by addressing memory directly…
The intrinsics correspond relatively directly to actual instructions, but compilers are not obligated to issue the corresponding instructions. Optimizing a load followed by an operation (even when written in intrinsics) into the memory form of the operation is a common optimization performed by all respectable compilers when it is advantageous to do so.
TLDR: write the load and the operation in intrinsics, and let the compiler optimize it.
Edit: trivial example:
Compiling with
gcc -Os -fomit-frame-pointergives:See? The optimizer will sort it out.