My (simd) implementation takes varied amount of time, though it is run for fixed input. The running time varies between say 100 million clock cycles to 120 million clock cycles. The program calls a function around 600 times, and the most expensive part of the function is in it memory is accessed ~2000 times. Thus, overall memory involvement in quite high in my program.
Is the variation in running time due to memory access patterns/initial memory contents?
I used valgrind to analyze profile my program. It shows each memory access takes about 8 instructions. Is this normal?
Following is the piece of code (function) that is called 600 times. Mulprev[32][20] is the array which is accessed most number of times.
j = 15;
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)
{
l = j << 2;
for (i = 0; i < 20; i++)
{
val1v = _mm_load_si128 ((__m128i *) &elm1v[i]);
uv = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
u1 = _mm_extract_epi16 (uv, 0);
u2 = _mm_extract_epi16 (uv, 4) + 16;
for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
{
temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]);
temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);
val1v = _mm_load_si128 ((__m128i *) &res[ival]);
val2v = _mm_load_si128 ((__m128i *) &res[ival1]);
bv = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
av = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));
_mm_store_si128 ((__m128i *) &res[ival], bv);
_mm_store_si128 ((__m128i *) &res[ival1], av);
}
}
if (j == 0)
break;
val0v = _mm_setzero_si128 ();
for (i = 0; i < 40; i++)
{
testv = _mm_load_si128 ((__m128i *) &res[i]);
val1v = _mm_srli_epi64 (testv, 60);
val2v = _mm_xor_si128 (val0v, _mm_slli_epi64 (testv, 4));
_mm_store_si128 (&res[i], val2v);
val0v = val1v;
}
j--;
}
I want to reduce the computation time of my program. Any suggestions?
You are performing almost no computation in between loads and stores, hence your execution time will most likely be dominated by the cost of I/O to/from cache/memory. Even worse, your data set appears to be relatively small. Probably the only way you can optimise this further is to improve the memory access pattern (make accesses sequential where possible, and ensure that cache lines are not wasted, etc) and/or combine these operations with other code which operates on the same data set before/after this routine (so that the cost of loads/stores in amortised somewhat).
EDIT: note that I gave a very similar answer when you asked much the same question for an apparently earlier version of this routine: How to make the following code faster – you seem to have missed the point that your main performance problem here is memory access, not computation.