I have a problem in our BTS production environment which we cannot reproduce in other environments. Bear with me here.
Part of our solution, an orchestration (orch1) makes sends a direct bound message to the message box and then steps into a listen shape with the correlated receive shape on one branch and a delay (implementing the receive timeout) on the other branch. The delay is set to 10 minutes.
The direct bound request is processed by a different orchestration (orch2), which then returns the response (again via direct bind) to the message box so that orch1 can pick it up.
What is happening is that about once in every 50 operations of this type the timeout in orch1 is being hit and when the response from orch2 comes back we get a routing failure (which is what you would expect as the instance subscription on orch1 for the message has been deleted).
The weird thing is that orch2 does not even initialise until AFTER the timeout has been hit in orch1 (see the following screenshots)

Here you can see orch1 sends the direct bound request to the message box and 10 minutes later the timeout is being hit. The request is sent at 11:26:31 and the timeout is hit at 11:36:32.

This shows the timings of orch2. As you can see the receive shape is only being hit after the timeout has fired in orch1 (at 11:36:45)
What is strange is that both orch1 and orch2 are hosted in the same host. Moreover, we have a load balanced cluster and we have 2 instances of this host available to do work. So I would expect that there should always be availability on orch2 to process incoming work. However this appears not to be the case.
My current suspicion is thread starvation across both host instances. However my question is
- Is this a sensible suspicion?
- Am I doing something fundamentally wrong?
- Is there anything about using the listen shape which affects threading?
Just to note, we have already configured host thread settings to recommended levels (MaxIOThreads = 100, MaxWorkerThreads = 100, MinIOThreads = 25, MinWorkerThreads = 25)
Sounds like a race condition but I have no idea where.
Have you considered separating out the tasks?
The drawback is this has no ability to respond to timeouts.
I don’t know if that’s important to your problem or not.