I have a collection of strings which I need to perform two operations on.

Question

0

Asked: June 13, 20262026-06-13T09:03:42+00:00 2026-06-13T09:03:42+00:00

I have a collection of strings which I need to perform two operations on.

0

I have a collection of strings which I need to perform two operations on.

The first of these can safely be processed independently in any order (yay), but then the output must be processed sequentially (boo) in the original order.

The following Plinq gets me most of the way there:

myStrings.AsParallel().AsOrdered()
         .Select( str => Operation1(str) )
         .AsSequential()
         .Select( str => Operation2(str) );
//immagine Operation2() maintains some sort of state and must take the outputs from Operation1 in the original order

This gets me most of the way there, but the problem is that because of the AsOrdered(), Operation1 gets executed on every string first, then the result elements are sorted back to their original order, then finally Operation2 starts executing.

Ideally, as soon as the first string (ie myStrings[0], not the first one returned) is returned by an Operation1 call, I’d like Operation2 to begin it’s work.

So this is my attempt to solve the problem generically:

public static class ParallelHelper
{
    public static IEnumerable<U> SelectAsOrdered<T, U>(this ParallelQuery<T> query, Func<T, U> func)
    {
        var completedTasks = new Dictionary<int, U>();
        var queryWithIndexes = query.Select((x, y) => new { Input = x, Index = y })
                                    .AsParallel()
                                    .Select(t => new { Value = func(t.Input), Index = t.Index })
                                    .WithMergeOptions(ParallelMergeOptions.NotBuffered);

        int i = 0;
        foreach (var task in queryWithIndexes)
        {
            if (i==task.Index)
            {
                Console.WriteLine("immediately yielding task: {0}", i);
                i++;
                yield return task.Value;

                U previouslyCompletedTask;
                while (completedTasks.TryGetValue(i, out previouslyCompletedTask))
                {
                    completedTasks.Remove(i);
                    Console.WriteLine("delayed yielding task: {0}", i);
                    yield return previouslyCompletedTask;
                    i++;
                }
            }
            else
            {
                completedTasks.Add(task.Index, task.Value);
            }
        }
        yield break;
    }
}

Then I can re-write my original code block as:

myStrings.AsParallel()
         .SelectAsOrdered( str => Operation1(str) )
         .Select(str => Operation2(str));

and Operation2 kicks off as soon as myStrings[0] comes out from Operation1.

What I’d like to know is:

This is a fairly common problem/pattern within parallelisation, have I missed something out of the box that does this in the .Net framework? Or is there a simpler way?
While the above extension method seems to do the job, how could it be improved? Does anything in the code look like it’s a bad idea?

Thanks!

Andy

Just in case you’re interested:

Without the call to .WithMergeOptions(ParallelMergeOptions.NotBuffered) Operation2 doesn’t begin its work until all Operation1 calls have been started (which is better than the original code which waited until they were all completed).
The real life problem:
Operation1 is searching for legal citations and references within large bodies of text (eg: “children act 1989”).
These references are usually independent, but occasionally a transcript will contain something like “section 6 of the previously mentioned act”.
Operation2 relies on captures from Operation1 to pick up these partial references.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T09:03:43+00:00

If you need speed, you can parallelize all process (load data, prepare data, process data and aggregate data), I think is better use a producer/consumer pattern.

But, if you would use “Linq” you can’t generate (in a good way to do a complete paralallel workflow) data as parallel (but yes: prepare, process and resume).

On the other hand, I think is wrong (you can, yes) trying to use “Linq” as “parallel(A) + sequential(B)”, your process (I think) is

B = f(A)

then, B must be wait to A.

Why not do simply “parallel(A/B)”?

You can do a helper (extension) but I think it isn’t useful in general.

In your real case, simply use a Semaphore to prevent premature access to a “Article ID”.

A complete code to prepare, process and resume in parallel (no generate) is:

class Text {
    public static Regex rx = new Regex(@" (PREVID|ACTID\=([0-9]+)) ");

    private Text prv; // previous article
    private string ot; // original text
    private int id; // act id on text
    private Semaphore isComputed = new Semaphore(0, 1);

    public int ActID {
        get {
            isComputed.WaitOne();
            int _id = id;
            isComputed.Release();
            return _id;
        }
    }

    public bool ProcessText() {
        var mx = rx.Match(ot);
        var prev = mx.Groups [1].Value == "PREVID";
        if(prev)
            id = prv == null ? 0 : prv.ActID;
        else
            if(!int.TryParse(mx.Groups [2].Value, out id))
                throw new Exception(string.Format(@"Incorrect article id ""{0}""", mx.Groups [0].Value));
        isComputed.Release();
        return !prev;
    }

    public Text(string original_text, Text previous) {
        prv = previous;
        ot = original_text;
    }

}

public static void Main(String [] args) {

    // same random stream (for debugging)
    var rnd = new Random(1);

    var noise = @"These references are usually independent, but occasionally";

    // some noise text
    var bit = new Func<string>(() =>
        noise.Substring(0, rnd.Next(noise.Length)));

    // random article
    var text = new Func<string>(() =>
        string.Format(@"{0}{1}{2}", bit(),
            rnd.Next() % 2 == 0 ? " PREVID "
                                : string.Format(@" ACTID={0} ", rnd.Next()), bit()));

    // random data input
    var data = new List<Text>();
    Text prv = null;
    for(var n = 0; n < 1000000; n++)
        // producer / consumer is better to parallelize load data step
        data.Add(prv = new Text(text(), prv));

    Console.Write("Press key to start...");
    Console.ReadKey();

    // parallel processing
    Console.WriteLine("{0} unique ID's", data.AsParallel().Where(n => n.ProcessText()).Count());

    Console.WriteLine("Process completed.");
}

as you can see, ProcessText process all articles in parallel. Only PREVID articles wait until their previous article is computing their own id.

The real problem to abstract this behavior (I think) is items relations (one item is dependent to another), in Linq, the natural way is a “no items relations” (you must use “group by” to perform it).

I suggest to you use a producer/consumer pattern.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a collection of strings which I need to perform two operations on.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply