I have a CSV file with two columns:
cat @ c a t
dog @ d o g
bat @ b a t
To simplify communication, I’ve used English letters for this example, but I’m dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if “o” appears on 15 lines in the first column, all appearances of “o” are deleted from the second column. If “a” appears on 35 lines in the first column, no change is made.
- The first column must not be changed.
- I don’t need to count multiple appearances of a letter on a single line. For e.g. “robot” has 2 o’s, but this detail is not important, only that “robot” has an “o”, so that is counted as one line.
How can I delete the characters that appear less than 20 times?
Here is a script using
awk. Change the varnumto be your frequency cutoff point. I’ve set it to1to show how it works against a small sample file. Note howfis still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.Sample Input
Output