I have a problem in our BTS production environment which we cannot reproduce in

Question

0

Asked: June 5, 20262026-06-05T23:45:16+00:00 2026-06-05T23:45:16+00:00

I have a problem in our BTS production environment which we cannot reproduce in

0

I have a problem in our BTS production environment which we cannot reproduce in other environments. Bear with me here.

Part of our solution, an orchestration (orch1) makes sends a direct bound message to the message box and then steps into a listen shape with the correlated receive shape on one branch and a delay (implementing the receive timeout) on the other branch. The delay is set to 10 minutes.

The direct bound request is processed by a different orchestration (orch2), which then returns the response (again via direct bind) to the message box so that orch1 can pick it up.

What is happening is that about once in every 50 operations of this type the timeout in orch1 is being hit and when the response from orch2 comes back we get a routing failure (which is what you would expect as the instance subscription on orch1 for the message has been deleted).

The weird thing is that orch2 does not even initialise until AFTER the timeout has been hit in orch1 (see the following screenshots)

Orch1 timings

Here you can see orch1 sends the direct bound request to the message box and 10 minutes later the timeout is being hit. The request is sent at 11:26:31 and the timeout is hit at 11:36:32.

Orch2 timings

This shows the timings of orch2. As you can see the receive shape is only being hit after the timeout has fired in orch1 (at 11:36:45)

What is strange is that both orch1 and orch2 are hosted in the same host. Moreover, we have a load balanced cluster and we have 2 instances of this host available to do work. So I would expect that there should always be availability on orch2 to process incoming work. However this appears not to be the case.

My current suspicion is thread starvation across both host instances. However my question is

Is this a sensible suspicion?
Am I doing something fundamentally wrong?
Is there anything about using the listen shape which affects threading?

Just to note, we have already configured host thread settings to recommended levels (MaxIOThreads = 100, MaxWorkerThreads = 100, MinIOThreads = 25, MinWorkerThreads = 25)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T23:45:17+00:00

Editorial Team

2026-06-05T23:45:17+00:00Added an answer on June 5, 2026 at 11:45 pm

Sounds like a race condition but I have no idea where.

Have you considered separating out the tasks?

First part of orch1 sends request.
Orch2 processes output from task 1.
Second part of orch1 processes responses from orch2/Task 2.

The drawback is this has no ability to respond to timeouts.
I don’t know if that’s important to your problem or not.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a problem in our BTS production environment which we cannot reproduce in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply