I’m writing different implementations of immutable binary trees in C#, and I wanted my trees to inherit some common methods from a base class.
Unfortunately, classes which derive from the base class are abysmally slow. Non-derived classes perform adequately. Here are two nearly identical implementations of an AVL tree to demonstrate:
- AvlTree: http://pastebin.com/V4WWUAyT
- DerivedAvlTree: http://pastebin.com/PussQDmN
The two trees have the exact same code, but I’ve moved the DerivedAvlTree.Insert method in base class. Here’s a test app:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using Juliet.Collections.Immutable;
namespace ConsoleApplication1
{
class Program
{
const int VALUE_COUNT = 5000;
static void Main(string[] args)
{
var avlTreeTimes = TimeIt(TestAvlTree);
var derivedAvlTreeTimes = TimeIt(TestDerivedAvlTree);
Console.WriteLine("avlTreeTimes: {0}, derivedAvlTreeTimes: {1}", avlTreeTimes, derivedAvlTreeTimes);
}
static double TimeIt(Func<int, int> f)
{
var seeds = new int[] { 314159265, 271828183, 231406926, 141421356, 161803399, 266514414, 15485867, 122949829, 198491329, 42 };
var times = new List<double>();
foreach (int seed in seeds)
{
var sw = Stopwatch.StartNew();
f(seed);
sw.Stop();
times.Add(sw.Elapsed.TotalMilliseconds);
}
// throwing away top and bottom results
times.Sort();
times.RemoveAt(0);
times.RemoveAt(times.Count - 1);
return times.Average();
}
static int TestAvlTree(int seed)
{
var rnd = new System.Random(seed);
var avlTree = AvlTree<double>.Create((x, y) => x.CompareTo(y));
for (int i = 0; i < VALUE_COUNT; i++)
{
avlTree = avlTree.Insert(rnd.NextDouble());
}
return avlTree.Count;
}
static int TestDerivedAvlTree(int seed)
{
var rnd = new System.Random(seed);
var avlTree2 = DerivedAvlTree<double>.Create((x, y) => x.CompareTo(y));
for (int i = 0; i < VALUE_COUNT; i++)
{
avlTree2 = avlTree2.Insert(rnd.NextDouble());
}
return avlTree2.Count;
}
}
}
- AvlTree: inserts 5000 items in 121 ms
- DerivedAvlTree: inserts 5000 items in 2182 ms
My profiler indicates that the program spends an inordinate amount of time in BaseBinaryTree.Insert. Anyone whose interested can see the EQATEC log file I’ve created with the code above (you’ll need EQATEC profiler to make sense of file).
I really want to use a common base class for all of my binary trees, but I can’t do that if performance will suffer.
What causes my DerivedAvlTree to perform so badly, and what can I do to fix it?
Note – there’s now a “clean” solution here, so skip to the final edit if you only want a version that runs fast and don’t care about all of the detective work.
It doesn’t seem to be the difference between direct and virtual calls that’s causing the slowdown. It’s something to do with those delegates; I can’t quite explain specifically what it is, but a look at the generated IL is showing a lot of cached delegates which I think might not be getting used in the base class version. But the IL itself doesn’t seem to be significantly different between the two versions, which leads me to believe that the jitter itself is partly responsible.
Take a look at this refactoring, which cuts the running time by about 60%:
This should (and apparently does) ensure that the insertion delegate is only being created once per insert – it’s not getting created on each recursion. On my machine it cuts the running time from 350 ms to 120 ms (by contrast, the single-class version runs in about 30 ms, so this is still nowhere near where it should be).
But here’s where it gets even weirder – after trying the above refactoring, I figured, hmm, maybe it’s still slow because I only did half the work. So I tried materializing the first delegate as well:
And guess what… this made it slower again! With this version, on my machine, it took a little over 250 ms on this run.
This defies all logical explanations that might relate the issue to the compiled bytecode, which is why I suspect that the jitter is in on this conspiracy. I think the first “optimization” above might be (WARNING – speculation ahead) allowing that insertion delegate to be inlined – it’s a known fact that the jitter can’t inline virtual calls – but there’s still something else that’s not being inlined and that’s where I’m presently stumped.
My next step would be to selectively disable inlining on certain methods via the
MethodImplAttributeand see what effect that has on the runtime – that would help to prove or disprove this theory.I know this isn’t a complete answer but hopefully it at least gives you something to work with, and maybe some further experimentation with this decomposition can produce results that are close in performance to the original version.
Edit: Hah, right after I submitted this I stumbled on another optimization. If you add this method to the base class:
Now the running time drops to 38 ms here, just barely above the original version. This blows my mind, because nothing actually references this method! The private
Insert<U>method is still identical to the very first code block in my answer. I was going to change the first argument to reference theCreateNilNodemethod, but I didn’t have to. Either the jitter is seeing that the anonymous delegate is the same as theCreateNilNodemethod and sharing the body (probably inlining again), or… or, I don’t know. This is the first instance I’ve ever witnessed where adding a private method and never calling it can speed up a program by a factor of 4.You’ll have to check this to make sure I haven’t accidentally introduced any logic errors – pretty sure I haven’t, the code is almost the same – but if it all checks out, then here you are, this runs almost as fast as the non-derived
AvlTree.FURTHER UPDATE
I was able to come up with a version of the base/derived combination that actually runs slightly faster than the single-class version. Took some coaxing, but it works!
What we need to do is create a dedicated inserter that can create all of the delegates just once, without needing to do any variable capturing. Instead, all of the state is stored in member fields. Put this inside the
BaseBinaryTreeclass:Yes, yes, I know, it’s very un-functional using that mutable internal
treestate, but remember that this isn’t the tree itself, it’s just a throwaway “runnable” instance. Nobody ever said that perf-opt was pretty! This is the only way to avoid creating a newInserterfor each recursive call, which would otherwise slow this down on account of all the new allocations of theInserterand its internal delegates.Now replace the insertion methods of the base class to this:
I’ve made the public
Insertmethod non-virtual; all of the real work is delegated to a protected method that takes (or creates its own)Inserterinstance. Altering the derived class is simple enough, just replace the overriddenInsertmethod with this:That’s it. Now run this. It will take almost the exact same amount of time as the
AvlTree, usually a few milliseconds less in a release build.The slowdown is clearly due to some specific combination of virtual methods, anonymous methods and variable capturing that’s somehow preventing the jitter from making an important optimization. I’m not so sure that it’s inlining anymore, it might just be caching the delegates, but I think the only people who could really elaborate are the jitter folks themselves.