The Requirements
I have a following table (pseudo DDL):
CREATE TABLE MESSAGE (
MESSAGE_GUID GUID PRIMARY KEY,
INSERT_TIME DATETIME
)
CREATE INDEX MESSAGE_IE1 ON MESSAGE (INSERT_TIME);
Several clients concurrently insert rows in that table, possibly many times per second. I need to design a “Monitor” application that will:
- Initially, fetch all the rows currently in the table.
- After that, periodically check if there are any new rows inserted and then fetch
these rows only.
There may be multiple Monitors concurrently running. All the Monitors need to see all the rows (i.e. when a row is inserted, it must be “detected” by all the currently running Monitors).
This application will be developed for Oracle initially, but we need to keep it portable to every major RDBMS and would like to avoid as much database-specific stuff as possible.
The Problem
The naive solution would be to simply find the maximal INSERT_TIME in rows selected in step 1 and then…
SELECT * FROM MESSAGE WHERE INSERT_TIME >= :max_insert_time_from_previous_select
…in step 2.
However, I’m worried this might lead to race conditions. Consider the following scenario:
- Transaction A inserts a new row but does not yet commit.
- Transaction B inserts a new row and commits.
- The Monitor selects rows and sees that the maximal INSERT_TIME
is the one inserted by B. - Transaction A commits. At this point, A’s INSERT_TIME is actually
earlier than the B’s (A’s INSERT was actually executed before
B’s, before we even knew who is going to commit first). - The Monitor selects rows newer than B’s INSERT_TIME (as a consequence of step 3). Since A’s INSERT_TIME is earlier than B’s insert time, A’s row is skipped.
So, the row inserted by transaction A is never fetched.
Any ideas how to design the client SQL or even change the database schema (as long as it is mildly portable), so these kinds of concurrency problems are avoided, while still keeping a decent performance?
Thanks.
Without using any of the platform-specific change data capture (CDC) technologies, there are a couple of approaches.
Option 1
Each Monitor registers a sort of subscription to the
MESSAGEtable. The code that writes messages then writes eachMESSAGEonce per Monitor, i.e.Each Monitor then deletes the message from its subscription once that is processed.
Option 2
Each Monitor maintains a cache of the recent messages it has processed that is at least as long as the longest-running transaction could be. If the Monitor maintained a cache of the messages it has processed for the last 5 minutes, for example, it would query your
MESSAGEtable for all messages later than itsLAST_MONITOR_TIME. The Monitor would then be responsible for noting that some of the rows it had selected had already been processed. The Monitor would only processMESSAGE_IDvalues that were not in its cache.Option 3
Just like Option 1, you set up subscriptions for each Monitor but you use some queuing technology to deliver the messages to the Monitor. This is less portable than the other two options but most databases can deliver messages to applications via queues of some sort (i.e. JMS queues if your Monitor is a Java application). This saves you from reinventing the wheel by building your own queue table and gives you a standard interface in the application tier to code against.