Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can’t get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
Here’s how you fix your
awkscript to get output 1:printadds a new line at the end by default, so you’ll have to use formatted stringsprintfto specify where exactly you want the new lines.(Also added
printf "%s ", $1;at the start to print the header at the start of each line)Edit: Triplee’s solution looks much more elegant than mine – you should ditch awk and go with his =)