In Django, I have ran into some serious race condition. The trouble starts when two runners try to execute some_method() at the same time. The logging created is as follows:
Job 3: Candidate
Job 3: Already taken
Job 3: Candidate
Job 3: Already taken
Job 3: Candidate
Job 3: Already taken
(et cetera for 18 MB)
The following method is giving me trouble. It should be noted that the method is re-ran until the method returns False:
def some_method():
conditions = #(amongst others, excludes jobs with status EXECUTING)
try:
cjob = Job.objects.filter(conditions).order_by(some_fields)[0]
except IndexError:
return False
print 'Job %s: Candidate' % cjob.id
job = cjob.for_update()
if cjob.status != job.status:
print 'Job %s: Already taken' % cjob.id
return True
print 'Job %s: Starting...' % job.id
job.status = Job.EXECUTING
job.save()
# Critical section
# In models.py:
class Job(models.Model):
# ...
def for_update(self):
return Job.objects.raw('SELECT * FROM `backend_job` WHERE `id` = %s FOR UPDATE', (self.id, ))[0]
Currently, Django doesn’t have a dedicated for_update-method and to prevent creating the query with all the conditions which we use to determine whether the job must be ran, we do the difficult query before the simple FOR UPDATE-query.
I don’t really see how this could cause the trouble we see, we do the query, followed by statement that blocks when another runner holds the lock on the job. The lock is only released after the job’s status has been changed. The second runner now gets the lock, but the job’s status was changed, so it returns from the method, only to re-enter it later; but the cjob-query will not return the same job again, as its status is now excluded by the filter.
Do I misinterpret the FOR UPDATE-clause, or am I missing something else?
It should be noted that I use MySQL with InnoDB and that Celery does not fit for this solution.
The problem has been fixed by manually updating the transaction. It seems that the QuerySet did not update since the start of the transaction. When two QuerySets would somehow start at the same time, and one job would occur in both QuerySets, it would break up the runners.
After reading this answer, I came up with a solution: just before the
return True, the transaction is committed.