When a Heroku worker is restarted (either on command or as the result of a deploy), Heroku sends SIGTERM to the worker process. In the case of delayed_job, the SIGTERM signal is caught and then the worker stops executing after the current job (if any) has stopped.
If the worker takes to long to finish, then Heroku will send SIGKILL. In the case of delayed_job, this leaves a locked job in the database that won’t get picked up by another worker.
I’d like to ensure that jobs eventually finish (unless there’s an error). Given that, what’s the best way to approach this?
I see two options. But I’d like to get other input:
- Modify
delayed_jobto stop working on the current job (and release the lock) when it receives aSIGTERM. - Figure out a (programmatic) way to detect orphaned locked jobs and then unlock them.
Any thoughts?
TLDR:
Put this at the top of your job method:
AND
Make sure this is called at least once every ten seconds:
AND
At the end of your method, put this:
Explanation:
I was having the same problem and came upon this Heroku article.
The job contained an outer loop, so I followed the article and added a
trap('TERM')andexit. Howeverdelayed_jobpicks that up asfailed with SystemExitand marks the task as failed.With the
SIGTERMnow trapped by ourtrapthe worker’s handler isn’t called and instead it immediately restarts the job and then getsSIGKILLa few seconds later. Back to square one.I tried a few alternatives to
exit:A
return truemarks the job as successful (and removes it from the queue), but suffers from the same problem if there’s another job waiting in the queue.Calling
exit!will successfully exit the job and the worker, but it doesn’t allow the worker to remove the job from the queue, so you still have the ‘orphaned locked jobs’ problem.My final solution was the one given at at the top of my answer, it comprises of three parts:
Before we start the potentially long job we add a new interrupt handler for
'TERM'by doing atrap(as described in the Heroku article), and we use it to setterm_now = true.But we must also grab the
old_term_handlerwhich the delayed job worker code set (which is returned bytrap) and remember tocallit.We still must ensure that we return control to
Delayed:Job:Workerwith sufficient time for it to clean up and shutdown, so we should checkterm_nowat least (just under) every ten seconds andreturnif it istrue.You can either
return trueorreturn falsedepending on whether you want the job to be considered successful or not.Finally it is vital to remember to remove your handler and install back the
Delayed:Job:Workerone when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).