I have a CSV file with several thousand customer’s details. I would like to extract duplicated customers based on having the same values for selected headers.
For example, I would like to extract all customers where there exists more than one record with the same ‘surname’ and ‘postcode’.
"surname","postcode","other-stuff-that-doesn't-matter"...
"smith", "AB1 2CD", "dxfh"...
"smith", "AB1 2CD", "98sf"...
"jones", "BC2 3DE", "as0j"...
"jones", "BC2 3DE", "9as6"...
"blogs", "BC2 3DE", "9as6"...
Based on the above, the program would return a new CSV like so:
"surname","postcode","other-stuff-that-doesn't-matter"...
"smith", "AB1 2CD", "dxfh"...
"smith", "AB1 2CD", "98sf"...
"jones", "BC2 3DE", "as0j"...
"jones", "BC2 3DE", "9as6"...
EDIT
Thanks for the help so far. I think I have a working solution, but I’m interested to know if this can be optimised (I’m sure it can!).
set_one = Set.new
set_two = Set.new
duplicates = Array.new
headers = nil
CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
headers = row.headers unless headers
values = [row[:surname], row[:post_code]]
if set_one.include? values
set_two << values
else
set_one << values
end
end
CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
values = [row[:surname], row[:post_code]]
if set_two.include? values
duplicates << row
end
end
CSV.open("duplicate-customers.csv", "wb") do |csv|
csv << headers
duplicates.each { |dupe| csv << dupe }
end
Let’s read in the csv first (doesn’t handle escapes or quotes, just an example)
Ok, what do we do with the data? It looks like:
How about something like:
Ok that’s slow and not clever. Let’s proceed to solution 2, generate a hash key from the data we want to check for uniqueness:
Far from optimal, but just thinking out loud here. If it’s a “one time only” kind of thing and you just need to parse a file, go with what you can come up with.
What you probably want to do is store this identifying string in an array or hash while reading the csv in and see if current row matches any of the stored unique rows and do something if it does.