I have a file containing the data shown below. The first comma-delimited field may be repeated any number of times, and I want to print only the lines after the sixth repetition of any value of this field
For example, there are eight fields with 1111111 as the first field, and I want to print only the seventh and eighth of these records
Input file:
1111111,aaaaaaaa,14
1111111,bbbbbbbb,14
1111111,cccccccc,14
1111111,dddddddd,14
1111111,eeeeeeee,14
1111111,ffffffff,14
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,aaaaaaaa,14
2222222,bbbbbbbb,14
2222222,cccccccc,14
2222222,dddddddd,14
2222222,eeeeeeee,14
2222222,ffffffff,14
2222222,gggggggg,14
3333333,aaaaaaaa,14
3333333,bbbbbbbb,14
3333333,cccccccc,14
3333333,dddddddd,14
3333333,eeeeeeee,14
3333333,ffffffff,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
Output:
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
What I have tried is to transponse the 2nd and 3rd fields with respect to 1st, so that I can use nawk on the field of $7 or $8
#!/usr/bin/ksh awk -F"," '{ a[$1]; b[$1]=b[$1]","$2 c[$1]=c[$1]","$3} END{ for(i in a){ print i","b[i]","c[i]} } ' file > output.txt
If your records are unordered
i.e. you may have "1111111" items distributed randomly throughout your input:
How does this work?
As you know, awk’s
-Foption sets the delimiter. If it’s not a special character, there’s no pressing need to quote it.Awk scripts consist of a series of blocks of
condition { action; }. If the condition is missing, action is applied to every line. If the action is missing, it is implied to beprint;. So an awk script that consists of simply a condition will print every input line for which that condition evaluates to true.In this case, our condition also has elements of an action. That it, it increments elements of an associative array where the keys are your first field. The increment happens regardless of whether the condition evaluates to true. Also, putting
++ahead rather than following the variable causes the increment to happen before the evaluation rather than after it. (I’m talking about the difference between++varandvar++.) And if the resultant incremented array element is greater than 6, the condition evaluates to true, causing the line to print.This is functionally equivalent to the
perlsolutions in other answers, but because of the nature awk scripts is even tighter and (arguably) simpler. And of course, it’s likely to be faster. (In my informal test just now, the awk script above executed more than twice as fast as an equivalent perl script from another answer, processing 250000 lines of input in 0.23s of user time vs 0.61 seconds in perl.)If your records are ordered
i.e. all your "1111111" lines are together:
How does this work?
This has the advantage of not eating memory with an array, but is only appropriate if your goal is to match sequential sets of lines with common $1 rather than handle matching lines that may be randomly distributed throughout your input.