Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7620913
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T04:03:54+00:00 2026-05-31T04:03:54+00:00

I’ve recently implemented a QuickSort algorithm in C#. Sorting on an integer array containing

  • 0

I’ve recently implemented a QuickSort algorithm in C#. Sorting on an integer array containing millions of items, the code’s performance is approximately 10% behind .NET’s implementation.

private static void QS(int[] arr, int left, int right)
{
    if (left >= right) return;

    var pIndex = Partition(arr, left, right);
    QS( arr, left, pIndex);
    QS( arr, pIndex + 1, right);
}

On an array of 5 million items, this code is about 60ms slower than .NET’s.

Subsequently, I created an another method that has the Partition() method inlined into QS() (eliminating the method call and the return statement). This however resulted in a drop in perfomance to about 250ms slower than .NET’s sort method.

Why does this happen?

Edit: This is the code the Partition() method. In the inlined version of QS(), the entire content of this method with the exception of the return statement replaces the var pIndex = Partition(arr, left, right); line.

private static int Partition(int[] arr, int left, int right)
{
    int pivot = arr[left];
    int leftPoint = left - 1;
    int pIndex = right + 1;
    int temp = 0;

    while (true)
    {
        do { pIndex--; } while (arr[pIndex] > pivot);
        do { leftPoint++; } while (arr[leftPoint] < pivot);

        if (leftPoint < pIndex)
        {
            temp = arr[leftPoint];
            arr[leftPoint] = arr[pIndex];
            arr[pIndex] = temp;
        }
        else { break; }
    }
    return pIndex;
}

Edit #2:
In case anyone’s interested in compiling, here’s the code that calls the algorithms:

Edit #3:
New test code from Haymo’s suggestion.

private static void Main(string[] args)
{
    const int globalRuns = 10;
    const int localRuns = 1000;

    var source = Enumerable.Range(1, 200000).OrderBy(n => Guid.NewGuid()).ToArray();
    var a = new int[source.Length];

    int start, end, total;

    for (int z = 0; z < globalRuns; z++)
    {
        Console.WriteLine("Run #{0}", z+1);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Array.Sort(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", ".NET", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", "Inlined", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortNonInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\tTtl: {1}ms\tAvg: {2}ms\n", "Not inlined", total, total / localRuns);
    }
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T04:03:56+00:00Added an answer on May 31, 2026 at 4:03 am

    Based on the information given in the question, one can only guess and name some ideas.

    Did you measure correctly? Keep in mind, in order to get reliable performance results, one should (at least)

    • Run release builds without debugger attached
    • Repeat the test sufficiently often and average the results
    • Make the test fair, ie. give all test subjects the same ‘ressource configuration’

    In order to make sure, the source of the (assumed) performance drop is really connected to function inlining one can investigate the generated IL code. Or even better: the machine instructions generated by the JIT compiler.

    For ILNumerics we implemented a custom quick sort and did a lot of performance measures. The final algorithm is several times faster then the CLR version. Manual inlining is only one improvement, which was neccessary for better performance. Others are:

    • Not using recursion
    • Using unsafe comparison / swapping
    • Using insertion sort for smaller partitions
    • Optimzing for limited size of the temporary (stack replacement) array

    Very often, the source for strange performance results is found in the way, memory is (mis)used by an algorithm. Another might lay in the different instruction flow which eventually more or less succeed in getting optimized by any compiler/processor involved. The overall execution performance is a highly complex beast, hard to guess deterministically and hence a profiler is your best friend!

    @Edit: By looking at your main test routine, it appears, you are mainly measuring the memory bandwidth of your processor/main memory. A int[] array of length 5*10e6 is roughly 19 MB in size. This most probably is out of range of your caches. So the processor will wait for memory due to compulsory cache misses most the time. This makes a measure of the influence of any code reformulation hard to guess. I suggest to try to measure this way instead: (pseudo code)

    • generate test data
    • allocate array for copies
    • iterate over number of global repetitions (let say 10)

      • inner repetitions for Array.Sort (eg. 1000)
        • copy the (unsorted) test data
        • sort the copy by Array.Sort
        • add time
      • average time for Array.Sort

      • inner repetitions for Quicksort.Sort (eg. 1000)

        • copy the (unsorted) test data
        • sort the copy by Quicksort.Sort
        • add time
      • average time for Quicksort.Sort

      • inner repetitions for Quicksort.Sort2 (eg. 1000)

        • copy the (unsorted) test data
        • sort the copy by Quicksort.Sort2
        • add time
      • average time for Quicksort.Sort2

    The goal is to make the quicksort only use data from the caches. Therefore, make sure not to recreate the copies from new memory but rather have only two global instances: the original and the copy to get sorted. Both must fit into your (last level) cache at the same time! With some headroom (for other processes on the system) a good guess is to use up only half of the available last level cache size for both arrays. Depending on your true cache size, a test length of 250k seems more reasonable.

    @Edit2: I ran your code, got the same results and watched the (optimized) machine instructions in the VS debugger. Here comes the relevant part from both versions:

    Not inlined: 
        69:                 do { pIndex--; } while (arr[pIndex] > pivot);
    00000017  dec         ebx 
    00000018  cmp         ebx,esi 
    0000001a  jae         00000053 
    0000001c  cmp         dword ptr [ecx+ebx*4+8],edi 
    00000020  jg          00000017 
        70:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
    00000022  inc         edx 
    00000023  cmp         edx,esi 
    00000025  jae         00000053 
    00000027  cmp         dword ptr [ecx+edx*4+8],edi 
    0000002b  jl          00000022 
    
    
    Inlined: 
        97:                 do { pIndex--; } while (arr[pIndex] > pivot);
    00000038  dec         dword ptr [ebp-14h] 
    0000003b  mov         eax,dword ptr [ebp-14h] 
    0000003e  cmp         eax,edi 
    00000040  jae         00000097 
    00000042  cmp         dword ptr [esi+eax*4+8],ebx 
    00000046  jg          00000038 
        98:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
    00000048  inc         ecx 
    00000049  cmp         ecx,edi 
    0000004b  jae         00000097 
    0000004d  cmp         dword ptr [esi+ecx*4+8],ebx 
    00000051  jl          00000048 
    

    As can be seen, the ‘Not inlined’ version does better utilize the registers for decrementing the running index (Line 69 / 97). Obviously the JIT decided not to push and pop the corresponding register on the stack, because of other code in the same function is making use of the same register. Since this is a hot loop (and the CLR does not recognize this) the overall execution speed suffers. So, in this specific case, manually inlining the Partition function is not profitable.

    However, as you know, there is no guarantee that other versions of the CLR do the same. Differences may even arise for 64 bit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but
I would like to count the length of a string with PHP. The string

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.