I’m going to have to code a very basic checksum function, something like:
char sum(const char * data, const int len)
{
char sum(0);
for (const char * end=data+len ; data<end ; ++data)
sum += *data;
return sum;
}
That’s trivial. Now, how should I optimize this?
First, I should probably use some std::for_each with a lambda or something like that:
char sum2(const char * data, const int len)
{
char sum(0);
std::for_each(data, data+len, [&sum](char b){sum+=b;});
return sum;
}
Next, I could use multiple threads/cores to sum up chunks, then add the results. I won’t write it down, and I’m afraid the cost of creating threads (or getting them from a pool anyway), then cutting up the array, then dispatching everything, etc, would not be very good considering that I would mostly calculate checksums for small arrays, mostly 10-100 bytes, rarely up to 1000.
But what I really want is something lower level, some SIMD stuff that would sum up bytes on 128b registers, or maybe sum bytes independently between two registers without carrying the carry, or both.
Is there any such thing out there ?
Note: This IS actual premature optimization, but it’s fun, so what the hell?
Edit: I still need a way to sum up all the bytes in an SSE register, something better than
char ptr[16];
_mm_storeu_si128((__m128i*)ptr, sum);
checksum += ptr[0] + ptr[1] + ptr[2] + ptr[3] + ptr[4] + ptr[5] + ptr[6] + ptr[7]
+ ptr[8] + ptr[9] + ptr[10] + ptr[11] + ptr[12] + ptr[13] + ptr[14] + ptr[15];
Yes, there are such instructions in the MMX instruction set, called “Packed ADD”:
_mm_add_pi8in Visual C++__builtin_ia32_paddbin gccAnd in the SSE2 instruction set:
_mm_add_epi8in Visual C++__builtin_ia32_paddb128in gccEDIT: A faster way to add the partial sums: