Given a long text file like this one (that we will call file.txt):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?
Assuming whitespace is significant, the typical solution is:
(eg, The line “ab ” is not considered the same as “ab”. It is probably simplest to pre-process the data if you want to treat whitespace differently.)
–EDIT–
Given the modified question, which I’ll interpret as only wanting to check uniqueness after a given column, try something like:
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named
x(one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in$0. In the second case we are only using the substring consisting of everything including and after the 2nd character.