i’m writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don’t obtain a real performance gain.
This is the X filter code (it is valid both for int and double cases):
foreach (pixel)
{
int value = 0;
for (int k = 0; k < filterOffsetsX.Length; k++)
{
value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k]; //index is relative to current pixel position
}
tempImage[index] = value;
}
In the integer case “value”, “InputImage” and “tempImage” are of “int”, “Image<byte>” and “Image<int>” types.
In the double case “value”, “InputImage” and “tempImage” are of “double”, “Image<double>” and “Image<double>” types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).
My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)
The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)
The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?
EDIT: simplified the code deleting unimportant vars.
EDIT2: i don’t think i have a cache miss related problema because “index”iterate through adjacent memory cells (row after row fashion). Moreover “filterOffstetsX” contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.
It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That’s assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.
Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn’t sweat that much CPU wastage here, even if we were missing a deadline somewhere.
digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.
If you really are doing enough floating-point ops to make the entire shebang take a very long time, I’d suggest looking into doing one or more of the following: