I’ve recently implemented a QuickSort algorithm in C#. Sorting on an integer array containing millions of items, the code’s performance is approximately 10% behind .NET’s implementation.
private static void QS(int[] arr, int left, int right)
{
if (left >= right) return;
var pIndex = Partition(arr, left, right);
QS( arr, left, pIndex);
QS( arr, pIndex + 1, right);
}
On an array of 5 million items, this code is about 60ms slower than .NET’s.
Subsequently, I created an another method that has the Partition() method inlined into QS() (eliminating the method call and the return statement). This however resulted in a drop in perfomance to about 250ms slower than .NET’s sort method.
Why does this happen?
Edit: This is the code the Partition() method. In the inlined version of QS(), the entire content of this method with the exception of the return statement replaces the var pIndex = Partition(arr, left, right); line.
private static int Partition(int[] arr, int left, int right)
{
int pivot = arr[left];
int leftPoint = left - 1;
int pIndex = right + 1;
int temp = 0;
while (true)
{
do { pIndex--; } while (arr[pIndex] > pivot);
do { leftPoint++; } while (arr[leftPoint] < pivot);
if (leftPoint < pIndex)
{
temp = arr[leftPoint];
arr[leftPoint] = arr[pIndex];
arr[pIndex] = temp;
}
else { break; }
}
return pIndex;
}
Edit #2:
In case anyone’s interested in compiling, here’s the code that calls the algorithms:
Edit #3:
New test code from Haymo’s suggestion.
private static void Main(string[] args)
{
const int globalRuns = 10;
const int localRuns = 1000;
var source = Enumerable.Range(1, 200000).OrderBy(n => Guid.NewGuid()).ToArray();
var a = new int[source.Length];
int start, end, total;
for (int z = 0; z < globalRuns; z++)
{
Console.WriteLine("Run #{0}", z+1);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Array.Sort(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", ".NET", total, total / localRuns);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Quicksort.SortInline(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", "Inlined", total, total / localRuns);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Quicksort.SortNonInline(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\tTtl: {1}ms\tAvg: {2}ms\n", "Not inlined", total, total / localRuns);
}
}
Based on the information given in the question, one can only guess and name some ideas.
Did you measure correctly? Keep in mind, in order to get reliable performance results, one should (at least)
In order to make sure, the source of the (assumed) performance drop is really connected to function inlining one can investigate the generated IL code. Or even better: the machine instructions generated by the JIT compiler.
For ILNumerics we implemented a custom quick sort and did a lot of performance measures. The final algorithm is several times faster then the CLR version. Manual inlining is only one improvement, which was neccessary for better performance. Others are:
Very often, the source for strange performance results is found in the way, memory is (mis)used by an algorithm. Another might lay in the different instruction flow which eventually more or less succeed in getting optimized by any compiler/processor involved. The overall execution performance is a highly complex beast, hard to guess deterministically and hence a profiler is your best friend!
@Edit: By looking at your main test routine, it appears, you are mainly measuring the memory bandwidth of your processor/main memory. A int[] array of length 5*10e6 is roughly 19 MB in size. This most probably is out of range of your caches. So the processor will wait for memory due to compulsory cache misses most the time. This makes a measure of the influence of any code reformulation hard to guess. I suggest to try to measure this way instead: (pseudo code)
iterate over number of global repetitions (let say 10)
average time for Array.Sort
inner repetitions for Quicksort.Sort (eg. 1000)
average time for Quicksort.Sort
inner repetitions for Quicksort.Sort2 (eg. 1000)
The goal is to make the quicksort only use data from the caches. Therefore, make sure not to recreate the copies from new memory but rather have only two global instances: the original and the copy to get sorted. Both must fit into your (last level) cache at the same time! With some headroom (for other processes on the system) a good guess is to use up only half of the available last level cache size for both arrays. Depending on your true cache size, a test length of 250k seems more reasonable.
@Edit2: I ran your code, got the same results and watched the (optimized) machine instructions in the VS debugger. Here comes the relevant part from both versions:
As can be seen, the ‘Not inlined’ version does better utilize the registers for decrementing the running index (Line 69 / 97). Obviously the JIT decided not to push and pop the corresponding register on the stack, because of other code in the same function is making use of the same register. Since this is a hot loop (and the CLR does not recognize this) the overall execution speed suffers. So, in this specific case, manually inlining the Partition function is not profitable.
However, as you know, there is no guarantee that other versions of the CLR do the same. Differences may even arise for 64 bit.