The input is like this:
CNNCC
NCNCN
NNNCC
CCNNN
CCCCN
The output should be like this:
CNNCC
CCCCN
which means, if there’re more than 3 occurences of N, that line would be filtered out, otherwise it’s kept. (In my work, I need to filter out 100000 lines with more than 500 N so performance might be important)
I know how to filter by consecutive N in awk, but I don’t know how to calculate inconsecutive ones..
Does anyone have ideas about this? Solutions in shell is also ok.
Among all the answers, I think this one might be the simpest:
awk -FN 'NF<=3'
or, for older
awkwhich does not support the-voption,The command uses the target char as the field separator and the maximum allowed occurence as
count. By comparing the resulting number of fields againstcountwe can selectively print lines that meet our criteria.The intention of the statement is not immediately obvious and therefore less readable. It does however has the advantage of having the char and
countparametrised and can therefore be easily reused for different settings.Admittedly, this would not be very efficient for large numbers of
count. Setting the maximum number of fields tocount+1would overcome this performance issue, unfortunately the-mfoption is ignored by gawk.