I have a collection of strings which I need to perform two operations on.
The first of these can safely be processed independently in any order (yay), but then the output must be processed sequentially (boo) in the original order.
The following Plinq gets me most of the way there:
myStrings.AsParallel().AsOrdered()
.Select( str => Operation1(str) )
.AsSequential()
.Select( str => Operation2(str) );
//immagine Operation2() maintains some sort of state and must take the outputs from Operation1 in the original order
This gets me most of the way there, but the problem is that because of the AsOrdered(), Operation1 gets executed on every string first, then the result elements are sorted back to their original order, then finally Operation2 starts executing.
Ideally, as soon as the first string (ie myStrings[0], not the first one returned) is returned by an Operation1 call, I’d like Operation2 to begin it’s work.
So this is my attempt to solve the problem generically:
public static class ParallelHelper
{
public static IEnumerable<U> SelectAsOrdered<T, U>(this ParallelQuery<T> query, Func<T, U> func)
{
var completedTasks = new Dictionary<int, U>();
var queryWithIndexes = query.Select((x, y) => new { Input = x, Index = y })
.AsParallel()
.Select(t => new { Value = func(t.Input), Index = t.Index })
.WithMergeOptions(ParallelMergeOptions.NotBuffered);
int i = 0;
foreach (var task in queryWithIndexes)
{
if (i==task.Index)
{
Console.WriteLine("immediately yielding task: {0}", i);
i++;
yield return task.Value;
U previouslyCompletedTask;
while (completedTasks.TryGetValue(i, out previouslyCompletedTask))
{
completedTasks.Remove(i);
Console.WriteLine("delayed yielding task: {0}", i);
yield return previouslyCompletedTask;
i++;
}
}
else
{
completedTasks.Add(task.Index, task.Value);
}
}
yield break;
}
}
Then I can re-write my original code block as:
myStrings.AsParallel()
.SelectAsOrdered( str => Operation1(str) )
.Select(str => Operation2(str));
and Operation2 kicks off as soon as myStrings[0] comes out from Operation1.
What I’d like to know is:
- This is a fairly common problem/pattern within parallelisation, have I missed something out of the box that does this in the .Net framework? Or is there a simpler way?
- While the above extension method seems to do the job, how could it be improved? Does anything in the code look like it’s a bad idea?
Thanks!
Andy
Just in case you’re interested:
-
Without the call to .WithMergeOptions(ParallelMergeOptions.NotBuffered) Operation2 doesn’t begin its work until all Operation1 calls have been started (which is better than the original code which waited until they were all completed).
-
The real life problem:
Operation1 is searching for legal citations and references within large bodies of text (eg: “children act 1989”).
These references are usually independent, but occasionally a transcript will contain something like “section 6 of the previously mentioned act”.
Operation2 relies on captures from Operation1 to pick up these partial references.
If you need speed, you can parallelize all process (load data, prepare data, process data and aggregate data), I think is better use a producer/consumer pattern.
But, if you would use “Linq” you can’t generate (in a good way to do a complete paralallel workflow) data as parallel (but yes: prepare, process and resume).
On the other hand, I think is wrong (you can, yes) trying to use “Linq” as “parallel(A) + sequential(B)”, your process (I think) is
then, B must be wait to A.
Why not do simply “parallel(A/B)”?
You can do a helper (extension) but I think it isn’t useful in general.
In your real case, simply use a
Semaphoreto prevent premature access to a “Article ID”.A complete code to prepare, process and resume in parallel (no generate) is:
as you can see,
ProcessTextprocess all articles in parallel. Only PREVID articles wait until their previous article is computing their own id.The real problem to abstract this behavior (I think) is items relations (one item is dependent to another), in Linq, the natural way is a “no items relations” (you must use “group by” to perform it).
I suggest to you use a producer/consumer pattern.