The class RunningJob has several methods that throw IOException (presumably when the connection to Hadoop fails?) The one I’m looking at right now is isComplete().
What’s the proper way to handle such an error? Should I assume that the job is dead? Should I wait and try again? Simply letting my application die at this point is not an option, since it’s managing a number of jobs on Hadoop and elsewhere, and it needs to be as robust as possible.
My answer is a bit too long for a comment, so I’m sorry I’m not directly answering your question. I’m mostly talking from experience in my response.
If an exception gets thrown up to this level, you can pretty much assume the job is going to die. I’ve found that just trying again or trying to automatically fixing the problem in response to an exception being thrown is futile. There is just too much that can go wrong.
Usually when a job that typically runs fine fails, there is something bad happening in the system that needs to be fixed. Perhaps the NameNode is dead, perhaps the switch went dumb, who knows. Unfortunately, these issues need attention by a human.
In my opinion, development effort is better spent on building some sort of alerting infrastructure (email, usually) that lets you know as soon as your job has failed… instead of accounting for a ton of corner cases.
Once you find some common issues with you cluster and jobs, you can start building exception handling into your applications. I don’t think it’s worth your time to account for everything up front.