I have a tab delimited file in which starting from columns 10-25 some of the values contain the “.” character. I want to filter out lines which match the “.” character, within this column column range,so that it does not print if it is found less than 8x times in columns 10-25 (i.e. less than 50% occurrence).
I have tried looking at similar posts and the closest I got to is by the user: lodge ( Match lines with pattern n times in the same line ) however, when I tried some of the commands it doesn’t behave in a way that I need to.
For example, the code below replaced everything with a dot…whilst I am aware it is because it is a global substitution, it seemed to work for lodge.
awk '{ if (gsub(/./, ".") >= 8) print }' merged.vcf > test.vcf
Here is an example of my file (I only include up to column 11 in this example):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AD0062-C AD0065-C
2L 560 . T C 30.65 PASS AC=3 GT:GQ:PL . .
2L 595 . G T 61.75 PASS AC=11 GT:GQ:PL . 0/1:13:132,0,10
If you want to check whether columns 10 – 25 are exactly
., do:If you only care that those columns contain a
., omit the^and$.