My Azure role grabs stuff to process from a database – it holds an instance of System.Data.SqlClient.SqlConnection and periodically creates an SqlCommand instance and executes an SQL query.
Now once in a while (usually once in several days) running a query will trigger an SqlException exception
The service has encountered an error processing your request. Please try again. Error code 40143.
A severe error occurred on the current command. The results, if any, should be discarded.
Which I’ve already seen many times and now my code catches it, calls Dispose() on the SqlConnection instance and then reopens the connection and retries the query. The latter typically results in another SqlException exception
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Which looks pretty much like SQL Azure server not responding or being unavailable for whatever reason.
Currently my code doesn’t catch the latter exception, it is propagated outside RoleEntryPoint.Run() and the role is restarted. Restart typically takes about ten minutes and once it completes the problem is gone for a day or so.
I don’t like my role restarting – it’s takes a while and my service functionality is hindered. I’d like to do something smarter.
What would be a strategy to address this problem? Should I retry the query several times and how many times and with what interval? Should I do something else? When do I give up and let the role just restart?
I would strongly recommend you have a look at the Transient Fault Handling Framework for SQL Azure
This will help you handle retry logic for both connection and query attempts, I am using this in production and it works great. There is also a nice article on technet that might be of some use.
[EDIT: 17 Oct 2013]
It looks like this has been picked up by the patterns and practices team at The Transient Fault Handling Application Block