I have clients with really bad networks, including bad mappings at the gateways and issues with aliasing. Sometimes they go days without a hitch, other days our services fail because they can’t connect to the database or the connections get mysteriously dropped.
How far should a program (namely a service) go to recover or retry? Is it reasonable to have their network folks get it working properly or should I take upon myself to survive its flakiness?
1) Yes, it’s reasonable to expect their network to work … you wouldn’t tell someone that the car they bought is broken because they don’t have and roads to drive it on, would you?
2) That said: program defensively. When you build a car, you can’t expect everything to be a perfectly smooth interstate highway.
More specifically, I like to build retry mechanisms into my systems: I’ll wrap something in ‘retryable’ logic, which lets you specify the number of retries. Typically, the retry period will have quadratic backoff: say, it tries after n*n seconds, for 1..n where n is the number of retries, or use fib(n) so you have something like 1,1,2,3,5 second retries. The backoff helps prevent causing unnecessary strain on the upstream resource
If, after a set number of retries, you can either throw an exception (which can be caught and inform a user or other modules of the error), or logged, depending on the severity.