I would like to normalize data in a DataTable insertRows without a key. To do that I need to identify and mark duplicate records by finding their ID (import_id). Afterwards I will select only the distinct ones. The approach I am thinking of is to compare each row against all rows in that DataTable insertRows
The columns in the DataTable are not known at design time, and there is no key. Performance-wise, the table would have as much as 10k to 20k records and about 40 columns
How do I accomplish this without sacrificing performance too much?
I attempted using linq but I did not know how to dynamically specify the where criteria Here I am comparing first and last names in a loop for each row
foreach (System.Data.DataRow lrows in importDataTable.Rows) { IEnumerable<System.Data.DataRow> insertRows = importDataTable.Rows.Cast<System.Data.DataRow>(); var col_matches = from irows in insertRows where String.Compare(irows['fname'].ToString(), lrows['fname'].ToString(), true).Equals(0) && String.Compare(irows['last_name'].ToString(), lrows['last_name'].ToString(),true).Equals(0) select new { import_id = irows['import_id'].ToString() }; }
Any ideas are welcome. How do I find similar column names using linq?>my similar question
The easiest way to get this done without O(n2) complexity is going to be using a data structure that efficiently implements Set operations, specifically a Contains operation. Fortunately .NET (as of 3.0) contains the HashSet object which does this for you. In order to make use of this you’re going to need a single object that encapsulates a row in your DataTable.
If DataRow won’t work, I recommend converting relevant records into strings, concatenating them then placing those in the HashSet. Before you insert a row check to see if the HashSet already contains it (using Contains). If it does, you’ve found a duplicate.
Edit:
This method is O(n).