SSE 4.2 perform comparation on two operands of 16 bytes at a time. But it is also possible to compare two operands of 8 bytes at a time with the ordinary processor instructions.
Difference is not so large, to have the special hardvare realization of such comparison. Is SSE 4.2 so irrelevance, or I missed something?
I’m not sure of the specifics of how the standard register comparison instructions perform in comparison to their wider SSE equivalents (it’s possible that the standard comparison instruction might require more cycles), but a 2x improvement in throughput isn’t anything to shake a stick at.
I think you’re asking “why even have SSE 4.2 if all you get is 2 comparisons at once instead of 1?” I think you’re overlooking a few things:
As I noted before, twice the width on an operation is nice to have. If you’re working on an application that does a lot of these comparisons, you’re probably happy that it’s there.
It’s likely that the incremental cost of adding this instruction to the already-existing SSE execution units was relatively small. There is already a lot of hardware in place to perform the wide range of operations already defined for the earlier SSE instruction sets.
Nowadays, the instructions that seem to get added are either wider
versions of older capabilities (e.g. many of the AVX instructions) or
operations that are important for certain specific applications (e.g.
the CRC/AES instructions, 4-element dot products). It’s possible that
there is some application that benefits a lot from such a comparison
instruction and the cost of adding it was worth the marketing benefit
achieved by being faster on those types of code.