I have a csv file which has two columns, a numeric ID (IDVAR) and an associated value (VAL). The second variable contains non-alphabetic garbage characters which need cleaning up. The structure looks like this:
IDVAR VAL
001 abc - 1
002 zfas $^6
003 asdf_78
004 hg :65
I want to throw out the "-", "_", "1", "$", "^" etc. from the 2nd variable only, i.e. remove a specified set of characters from VAL, without touching IDVAR.
Post-Solution Edit: Many thanks to SiegeX for such an elegant solution. Please note that my file is indeed comma-separated, so I just have to add an “-F,” option to his awk command.
This will work for you:
Example
Explanation
NR>1: Skip the header row containingIDVAR VALt=$1: Save the first field (IDVAR) into temporary variable ‘t’gsub(/[^[:alpha:]]/,""): Regex that says to replace all non-alphanumeric characters with the empty string. Notegsub()applies to the entire line which is why we used ‘t’ above$0=t "\t" $0: Prepend the variable ‘t’ to the beginning of the line separated by a tab1: Awk shortcut for print $0 since ‘1’ is always true and the default behavior for a true statement when not explicitly specified is to print the current line.