I am attempting to remove duplicates from a .NET datatable consisting of more than 50,000 rows. My approach is simple: I want to sort the datatable alphabetically, then scan through looking for rows that are the same as the row above it.
The problem I’m having is that the datatable “wraps” around when sorted. I use this to sort it:
myDataTable.DefaultView.Sort = "name";
When I view the datatable using the debugger, it is sorted alphabetically in chunks, like so:
Aardvark
Apple
Banana
...(20,000 rows later)...
Aardvark
Angle
Boat
Obviously this ruins my attempt to find duplicates. Is this some sort of optimization behavior of the framework when dealing with large tables? What is going on here?
Solution:
Here is what I was doing..
myDataTable.DefaultView.Sort = "name";
for (int i =0; i< myDataTable.DefaultView.Table.Rows.Count; i++)
{
var thisRow = myDataTable.DefaultView.Table.Rows[i];
var prevRow = myDataTable.DefaultView.Table.Rows[i-1];
}
Here is what I should have been doing:
myDataTable.DefaultView.Sort="name";
var myNewDatatable = myDataTable.DefaultView.ToTable();
for (int i =0; i< myNewDatatable.Rows.Count; i++)
{
var thisRow = myNewDatatable.Rows[i];
var prevRow = myNewDatatable.Rows[i-1];
}
Here you’re sorting the
DataViewfor theDataTableand not theDataTable.So you have to either use the
DataView(myDataTable.DefaultView)or getting the
DataRow‘s of theDataTablesorted by name