We had a terrible problem/experience yesterday when trying to swap our staging <–> production

Question

0

Editorial Team

Asked: May 31, 20262026-05-31T19:29:56+00:00 2026-05-31T19:29:56+00:00

We had a terrible problem/experience yesterday when trying to swap our staging <–> production

0

We had a terrible problem/experience yesterday when trying to swap our staging <–> production role.

Here is our setup:

We have a workerrole picking up messages from the queue. These messages are processed on the role. (Table Storage inserts, db selects etc ). This can take maybe 1-3 seconds per queue message depending on how many table storage posts he needs to make. He will delete the message when everything is finished.

Problem when swapping:

When our staging project went online our production workerrole started erroring.

When the role wanted to process queue messsage it gave a constant stream of ‘EntityAlreadyExists‘ errors. Because of these errors queue messages weren’t getting deleted. This caused the queue messages to be put back in the queue and back to processing and so on….

When looking inside these queue messages and analysing what would happend with them we saw they were actually processed but not deleted.

The problem wasn’t over when deleting these faulty messages. Newly queue messages weren’t processed as well while these weren’t processed yet and no table storage records were added, which sounds very strange.

When deleting both staging and producting and publishing to production again everything started to work just fine.

Possible problem(s)?

We have litle 2 no idea what happened actually.

Maybe both the roles picked up the same messages and one did the post and one errored?
…???

Possible solution(s)?

We have some idea’s on how to solve this ‘problem’.

Make a poison message fail over system? When the dequeue count gets over X we should just delete that queue message or place it into a separate ‘poisonqueue’.
Catch the EntityAlreadyExists error and just delete that queue message or put it in a separate queue.
…????

Multiple roles

I suppose we will have the same problem when putting up multiple roles?

Many thanks.

EDIT 24/02/2012 – Extra information

We actually use the GetMessage()
Every item in the queue is unique and will generate unique messages in table Storage. Little more information about the process: A user posts something and will have to be distributed to certain other users. The message generate from that user will have a unique Id (guid). This message will be posted into the queue and picked up by the worker role. The message is distributed over several other tables (partitionkey -> UserId, rowkey -> Some timestamp in ticks & the unique message id. So there is almost no chance the same messages will be posted in a normal situation.
The invisibility time out COULD be a logical explanation because some messages could be distributed to like 10-20 tables. This means 10-20 insert without the batch option. Can you set or expand this invisibility time out?
Not deleting the queue message because of an exception COULD be a explanation as well because we didn’t implement any poison message fail over YET ;).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T19:29:57+00:00

You clearly have a fault on handling double messages. The fact that your ID is unique doesn’t mean that the message will not be processed twice in some occasions like:

The role dying and with partially finished work, so the message will re-appear for processing in the queue
The role crashing unexpected, so the message ends up back in the queue
The FC migrating moving your role and you don’t have code to handle this situation, so the message ends up back in the queue

In all cases, you need code that handles the fact that the message will re-appear. One way is to use the DequeueCount property and check how many times the message was removed from a Queue and received for processing. Make sure you have code that handles partial processing of a message.

Now what probably happened during swapping was, when the production environment became the staging and staging became production, both of them were trying to receive the same messages so they were basically competing each other fro those messages, which is probably not bad because this is a known pattern to work anyway but when you killed your old production (staging) every message that was received for processing and wasn’t finished, ended up back in the Queue and your new production environment picked the message for processing again. Having no code logic to handle this scenario and a message was that partially processed, some records in the tables existed and it started causing the behavior you noticed.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We had a terrible problem/experience yesterday when trying to swap our staging <–> production

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply