Can GNU sed be used to ID a pattern based on rows? Or in other words, how can you insert a line break in the pattern you’re using sed to ID?
For example, in the following dataset (which is much larger in actuality), I have an error that should have been removed when I searched for duplicates, but was not because the information is slightly different in two rows (which is irrelevant at this point).
In this case, I want to remove the error entirely from the original file.In other words, if, within my file, two rows of rs#### follow each other, I would like to erase these two copies, and also the six lines that follow them. It would be nice to relocate them to a new file, but what is most critical is that they are removed from the original.
rs1038864 16 73762557 A G
1 1633 0.5835 -0.0004 0.0035
1 1643 0.8902 0.004436 0.004354
0 0 0 0 0
rs1019567 16 83343715 G T
rs1019567 16 83343715 G T
1 1641 0.4692 0.0009 0.0035
1 559 0.4612 -0.0025 0.0060
1 1643 0.5178 -0.002244 0.002745
1 1643 0.5178 -0.002244 0.002745
1 1909 0.493842692 0.0008 0.0027
1 1950 0.493842692 0.0008 0.0027
rs1038556 16 55132072 C T
1 6388 0.7773 0.0020 0.0044
1 6843 0.1161 0.001379 0.004275
1 1509 0.978660942 0.0041 0.0096
rs1019797 16 87788686 C G
rs1019797 16 87788686 C G
1 1639 0.717 0.0022 0.0038
1 5557 0.7193 0.0020 0.0064
1 1643 0.6691 -0.001044 0.002888
1 6843 0.6691 -0.001044 0.002888
1 1959 0.315280799 -0.0041 0.0032
1 1909 0.315280799 -0.0041 0.0032
rs1038887 16 62660698 A G
1 1688 0.4947 -0.0028 0.0035
0 0 0 0 0
1 1909 0.464393658 0.0007 0.0028
Something like,
sed -i '/^rs.*d
^rs.*/,+6d' test.data
or perhaps
sed -i '/^rs.*;^rs.*/,+6d' test.data
?
Any thoughts would be appreciated!
If
infilecontains the listed input, something like this should do (GNU sed):If you want to save the deleted bits to
deleted.txtuse this:Note that the
wcommand needs to be terminated by a newline.Explanation
This loads a second line into the pattern space (
N) and checks if the lines are duplicates (/([^\n]+)\n\1/), if the are six more lines are loaded into pattern space and deleted (d).