Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:
public static void SingleLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
count += i % 16 == 0 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
public static void MultiLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
var isMultipleOf16 = i % 16 == 0;
count += isMultipleOf16 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Multi-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.
Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this…
static void Main()
{
SingleLineTest();
MultiLineTest();
}
…and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that “Optimize code” setting is enabled, and run the test from outside VS2010.) But reverse the order…
static void Main()
{
MultiLineTest();
SingleLineTest();
}
…and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it’s useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn’t matter, except for which method is called first.)
Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first…
static void Main()
{
RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
MultiLineTest();
SingleLineTest();
}
Now, SingleLineTest is faster again.
If you turn off “Suppress JIT optimization on module load” in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.
A sample project demonstrating this behavior is on github.
It’s not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.
Update: See also the Microsoft Connect entry for this issue.
Please note that I do not trust the “Suppress JIT optimization on module load” option, I spawn the process without debugging and attach my debugger after the JIT has run.
In the version where single-line runs faster, this is
Main:Note that
MultiLineTesthas been placed on an 8 byte boundary, andSingleLineTeston a 4 byte boundary.Here’s
Mainfor the version where both run at the same speed:Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.
More digging is necessary. I think it was mentioned that the code before the loop wasn’t exactly the same in both versions? Going to investigate.
Here’s the “slow” version of
SingleLineTest(and I checked, the last digits of the function address haven’t changed).And the “fast” version:
Just the loops, fast on the left, slow on the right:
The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the
jeloading a constant1is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps (jmpafter loading a constant zero, andjbrepeating the entire loop) are taken millions more times, and are aligned in the “fast” version.I think this is the smoking gun.