The input is like this: CNNCC NCNCN NNNCC CCNNN CCCCN The output should be

Question

0

Asked: June 13, 20262026-06-13T18:01:41+00:00 2026-06-13T18:01:41+00:00

The input is like this: CNNCC NCNCN NNNCC CCNNN CCCCN The output should be

0

The input is like this:

CNNCC
NCNCN
NNNCC
CCNNN
CCCCN

The output should be like this:

CNNCC
CCCCN

which means, if there’re more than 3 occurences of N, that line would be filtered out, otherwise it’s kept. (In my work, I need to filter out 100000 lines with more than 500 N so performance might be important)

I know how to filter by consecutive N in awk, but I don’t know how to calculate inconsecutive ones..

Does anyone have ideas about this? Solutions in shell is also ok.

Among all the answers, I think this one might be the simpest:

awk -FN 'NF<=3'

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T18:01:42+00:00

awk -FN -vcount=3 'NF<=count'

or, for older awk which does not support the -v option,

awk -FN 'NF<=count' count=3

The command uses the target char as the field separator and the maximum allowed occurence as count. By comparing the resulting number of fields against count we can selectively print lines that meet our criteria.

The intention of the statement is not immediately obvious and therefore less readable. It does however has the advantage of having the char and count parametrised and can therefore be easily reused for different settings.

Admittedly, this would not be very efficient for large numbers of count. Setting the maximum number of fields to count+1 would overcome this performance issue, unfortunately the -mf option is ignored by gawk.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The input is like this: CNNCC NCNCN NNNCC CCNNN CCCCN The output should be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply