I cannot understand how GroupBy() appears to perform faster for a multi pass ResultSelector than for a single pass version.
Given this class:
public class DummyItem
{
public string Category { get; set; }
public decimal V1 { get; set; }
public decimal V2 { get; set; }
}
I create an array with 100,000 entries with some random data and then iterate the following query:
APPROACH 1: Multiple passes for category totals
var q = randomData.GroupBy(
x => x.Category,
(k, l) => new DummyItem
{
Category = k,
V1 = l.Sum(x => x.V1), // Iterate the items for this category
V2 = l.Sum(x => x.V2), // Iterate them again
}
);
It appears to be double handling the inner enumerable where it sums V1 and V2 for each category.
So I put the following alternative together, presuming that this would provide better performance by calculating category totals in a single pass.
APPROACH 2: Single pass for category totals
var q = randomData.GroupBy(
x => x.Category,
(k, l) => l.Aggregate( // Iterate the inner list once per category
new decimal[2],
(t,d) =>
{
t[0] += d.V1;
t[1] += d.V2;
return t;
},
t => new DummyItem{ Category = k, V1=t[0], V2=t[1] }
)
);
Fairly typical results:
'Multiple pass': iterations=5 average=2,961 ms each
'Single pass': iterations=5 average=5,146 ms each
Incredibly, Approach 2 takes up to twice as long as Approach 1. I have run numerous benchmarks varying the number of V* properties, the number of distinct categories and other factors. While the magnitude of the performance difference varies, Approach 2 is always substantially slower than Approach 1.
Am I missing something fundamental here? How can Approach 1 be faster than approach 2?
(I sense a facepalm coming…)
* UPDATE *
After @Jirka’s answer I thought it would be worth removing GroupBy() from the picture to see if simple aggregations on a large list performed as expected. The task was simply to compute the totals for the two decimal variables on the same list of 100,000 random rows.
The results continued the surprises:
SUM: ForEach
decimal t1 = 0M;
decimal t2 = 0M;
foreach(var item in randomData)
{
t1 += item.V1;
t2 += item.V2;
}
The baseline. I believe the fastest way of getting the required output.
SUM: Multipass
x = randomData.Sum(x => x.V1);
y = randomData.Sum(x => x.V2);
SUM: Singlepass
var result = randomData.Aggregate(new DummyItem(), (t, x) =>
{
t.V1 += x.V1;
t.V2 += x.V2;
return t;
});
The results were as follows:
'SUM: ForEach': iterations=10 average=1,793 ms each
'SUM: Multipass': iterations=10 average=2,030 ms each
'SUM: Singlepass': iterations=10 average=5,714 ms each
Surprisingly it reveals the issue has nothing to do with GroupBy. The behavior is consistent with data aggregation generally. My assumption that it is better to do data aggregation in a single pass is simply wrong (probably a hangover from my db roots).
(facepalm)
As @Jirka has pointed out the in-lining apparently occuring for the multipass approach, means it is only marginally slower than the baseline ‘ForEach’. My naive attempt to optimise to a single-pass, ran almost 3 times slower!
It appears that when dealing with in-memory lists, whatever it is you wish to do with the items in the list is likely to be a far bigger factor in performance, than the iteration overhead.
Aggregate has to create 99,999 activation records (for non-inlineable method calls) in the process. That offsets the advantage of the single pass.
Think of Count, Sum, Average etc. as optimized special cases of what Aggregate can do in the general case.