I am tasks with comparing 2 large unsorted .csv files based on column 1 and 3.
Each file contains about 200k records. For the output, I need to know which records based on column 1 and 3 exist in the first file but not the second file. The files are quoted comma separated value files. Column 3 needs to ignore case when comparing.
Example File1:
"id", "name", "email", "country"
"1233", "jake", "jake@mailinator.com", "USA"
"2345", "alison", "Alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"
File 2
"id", "name", "email", "country"
"2345", "alison", "alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5690", "lina", "lina@mailinator.com", "Canada"
desired Output file
"5678", "natalia", "natalia@mailinator.com", "USA"
Code examples would be very appreciated.
Try:
How it works:
1) I first create a composite key column, by joining column 1 and column3:
2) I sort both outputs:
3) I then use the
joincommand to join on the first column (my composite key) and output the unpairable lines coming from file 1.Output: