I have to update 5 million+ records in a Database for a table T1. This is a C# tool which will READ (Select) a column in the table T1, say, T1.col1, then extract a value based on a logic from that column and finally have to UPDATE another column T1.col2 in the same table with this processed value and update the Db.
Wanted some opinions on the best/optimised way to achieve this in C# / ADO.NET ?
NOTE: The extraction logic cannot be part of SQL. That logic is
embedded in a COM DLL which I am interoping from .NET and applying on
the column Col1’s value to generate a new value which has to be finally saved in T1.Col2.
Since you need to transfer tha data for some operation by a COM object this is what I would do:
Use a machine with lots of memory – Load the data in chunks (for example 5000 or 50000 rows at a time) into memory, process it and do the update on the SQL Server…
For the UPDATE part use transactions and put 5000 – 20000 UPDATEs into one transaction…
[EDIT]: by partitioning the work properly and assigning for 500000 or 1000000 rows to one “worker-machine” you can speed this up to the max limit of your SQL Server… [/EDIT]
Another option – though not recommended (only because of theoretically possible security and/or stability issues introduced by the COM object in this specific case):
Though this is a desciption regarding SQL Server something similar is possible with Oracle on Windows too
You can put the logic of this transformation into your SQL Server by writing+installing a .NET assembly which exposes a Stored Procedure you can call to do the transformation… the .NET assembly in turn access that COM object… for a howto see http://www.sqlteam.com/article/writing-clr-stored-procedures-in-charp-introduction-to-charp-part-1
The MSDN reference link to this is http://msdn.microsoft.com/en-us/library/ms131094.aspx