I have two processes that work with data in the same table.
One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.
The second process calls ( periodically, every 15 minutes ) a stored procedure that
- Detects the duplicates in those 20000 records by looking at all the records 7 days back and marks them as such.
- Marks all records that are not duplicates with a ‘ToBeCopied’ flag.
- Select a number of columns from the records marked as ‘ToBeCopied’ and returns the set.
Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked ‘ToBeCopied’ without having gone through the duplicate sifting.
This means that now the store procedure is returning some duplicates.
This is my theory but In practice I have not been able to replicate it…
I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.
It appears that when the stored procedure is running the inserting pauses … such that at the end no duplicates have made it to the final result set.
I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.
What do you think?
EDIT 1:
The ‘duplicates’ are not identical rows. They are ‘equivalent’ given the business/logical entities these records represent. Each row has a unique primary key.
P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.
What do I think?
Update
Consider this for instance:
A staging table with a structure like:
Process
Ais doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:Process
Bis tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record bydateshould be kept:Assuming Process
Aonly inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.