Let’s say I’ve got a SQL Server database table with X (> 1,000,000) records in it that need to be processed (get data, perform external action, update status in db) one-by-one by some worker processes (either console apps, windows service, Azure worker roles, etc). I need to guarantee each row is only processed once. Ideally exclusivity would be guaranteed no matter how many machines/processes were spun up to process the messages. I’m mostly worried about two SELECTs grabbing the same rows simultaneously.
I know there are better datastores for queuing out there, but I don’t have that luxury for this project. I have ideas for accomplishing this, but I’m looking for more.
I’ve had this situation.
Add an
InProcesscolumn to the table, default = 0. In the consumer process:Now that machine owns the row, and you can query its data without fear. Usually your next line will be something like this:
You’ll also have to add a
Doneflag of some kind to the row, so you can tell if the row was claimed but processing was incomplete.Edit
The
UPDATEgets an exclusive lock (see MSDN). I’m not sure if theSELECTin the subquery is allowed to be split from theUPDATE; if so, you’d have to put them in a transaction.@Will A posts a link which suggests that beginning your batch with this will guarantee it:
…but I haven’t tried it.
@Martin Smith’s link also makes some good points, looking at the
OUTPUTclause (added in SQL 2005).One last edit
Very interesting exchange in the comments, I definitely learned a few things here. And that’s what SO is for, right?
Just for color: when I used this approach back in 2004, I had a bunch of web crawlers dumping URLs-to-search into a table, then pulling their next URL-to-crawl from that same table. Since the crawlers were attempting to attract malware, they were liable to crash at any moment.