MSDN says that ParallelEnumerable.GroupBy groups in parallel the elements of a sequence according to a specified key selector function.
So my question is: How lazy it is ?
It’s clear that ParallelQuery<IGrouping<,>> is lazy. But what about IGrouping<> itself, is it lazy as well ?
So, if I do the following:
var entities = sites.AsParallel()
.Select(x => GetDataItemsFromWebsiteLazy(x))
.SelectMany(x => x)
.GroupBy(dataItem => dataItem.Url.Host)
.AsParallel()
.SelectMany(x => TransformToEntity(x));
Will TransformToEntity be called first time after all sites will fetch results?
Or as soon as first GetDataItemsFromWebsiteLazy() method will yield return an element?
The point of all that is to fire requests to different hosts in parallel.
Data processing goes as follows. For every website in a set:
- Request website
- Parse response and extract another site url
- Request site by extracted url
- Parse response and create entity from obtained data
The
GroupByextension is, in fact, not lazy at all (or, more accurately, not deferred at all), as can be easily demonstrated with the following test program:This program outputs the following:
Meaning that even though we never actually iterate the result of the
GetEvenNumbersUsingGroupBymethod, it still gets executed.This is in contrast to a normal deferred enumerable using the
yieldstatement, as in:This prints the following:
In other words, each time you iterate the results, they are re-evaluated, which is a typical characteristic of deferred evaluation (as opposed to straight-up lazy loading which caches the result after the first evaluation).
Note that this is the same whether you use
AsParallelor not; it’s a characteristic of theGroupByextension (which by definition needs to build a hash table or other kind of lookup in order to store the individual groups) and wholly independent of concurrency.It’s easy to see why this is the case if you think about how you would implement a deferred grouping function; in order to iterate all of the elements of a single group, you would have to iterate the entire sequence to be sure that you’ve actually covered all of the elements of that group. So while it might technically be possible to defer this one-time iteration of the entire sequence, it’s probably not worth it in most cases, since it’s going to have the exact same memory and CPU characteristics as the eagerly-loaded version.