I am using mySQL from their C API, but that shouldn’t be relevant.
My code must process records from a table that match some criteria, and then update the said records to flag them as processed. The lines in the table are modified/inserted/deleted by another process I don’t control. I am afraid in the following, the UPDATE might flag some records erroneously since the set of records matching might have changed between step 1 and step 3.
SELECT * FROM myTable WHERE <CONDITION>; # step 1
<iterate over the selected set of lines. This may take some time.> # step 2
UPDATE myTable SET processed=1 WHERE <CONDITION> # step 3
What’s the smart way to ensure that the UPDATE updates all the lines processed, and only them? A transaction doesn’t seem to fit the bill as it doesn’t provide isolation of that sort: a recently modified record not in the originally selected set might still be targeted by the UPDATE statement. For the same reason, SELECT … FOR UPDATE doesn’t seem to help, though it sounds promising 🙂
The only way I can see is to use a temporary table to memorize the set of rows to be processed, doing something like:
CREATE TEMPORARY TABLE workOrder (jobId INT(11));
INSERT INTO workOrder SELECT myID as jobId FROM myTable WHERE <CONDITION>;
SELECT * FROM myTable WHERE myID IN (SELECT * FROM workOrder);
<iterate over the selected set of lines. This may take some time.>
UPDATE myTable SET processed=1 WHERE myID IN (SELECT * FROM workOrder);
DROP TABLE workOrder;
But this seems wasteful and not very efficient.
Is there anything smarter?
Many thanks from a SQL newbie.
I eventually solved this issue by using a column in that table that flags lines according to their status. This column let’s me implement a simple state machine. Conceptually, I have two possible values for this status:
Now my algorithm does something like this:
This idea of having rows in several states can be extended to as many states as needed.