I need to iterate through a set of bytes, searching for a 4 byte value (all 4 bytes are the same). The length of the data is variable and these bytes can be anywhere inside the data; I’m looking for the first instance. I’m trying to find the fastest possible implementation because this logic runs in a critical part of my code.
This will only ever run on x86 & x64, under Windows.
typedef unsigned char Byte;
typedef Byte* BytePtr;
typedef unsigned int UInt32;
typedef UInt32* UInt32Ptr;
const Byte MARKER_BYTE = 0xAA;
const UInt32 MARKER = 0xAAAAAAAA;
UInt32 nDataLength = ...;
BytePtr pData = ...;
BytePtr pEnd = pData + nDataLength - sizeof ( UInt32 );
// Option 1 -------------------------------------------
while ( pData < pEnd )
{
if ( *( (UInt32Ptr) pData ) == MARKER )
{
... // Do something here
break;
}
pData++;
}
// Option 2 -------------------------------------------
while ( pData < pEnd )
{
if ( ( *pData == MARKER_BYTE ) && ( *( (UInt32Ptr) pData ) == MARKER ) )
{
... // Do something here
break;
}
pData++;
}
I think Option 2 is faster but I’m not sure if my reasoning is correct.
Option 1 first reads 4 bytes from memory, checks it against the 4-byte constant and if not found, it steps onto the next byte and starts over. The next 4-byte ready from memory is going to overlap 3 bytes already read so the same bytes need to be fetched again. Most bytes before my 4-byte marker would be read twice.
Option 2 only reads 1 byte at a time and if that single byte is a match, it reads the full 4-byte value from that address. This way, all bytes are read only once and only the 4 matching bytes are read twice.
Is my reasoning correct or am I overlooking something?
And before someone brings it up, yes, I really do need to perform this kind of optimization. 🙂
Edit: note, that this code will only ever run on Intel / AMD based computers. I don’t care if other architectures would fail to run this, as long as normal x86 / x64 computers (desktops / servers) run this without problems or performance penalties.
Edit 2: compiler is VC++ 2008, if that helps.
You might also try the Boyer-Moore approach.
If you’re lucky, that tests only every fourth byte until you found a match.
Whether that’s faster or slower than testing every single byte is anybody’s guess for short patterns as this.