I have a problem that I am trying to use awk to solve. It has application in selecting good quality single nucleaotide ploymorphisms (SNP) for placing on a SNP-chip, where there is a requirement that no SNP is within 60bp of another SNP. The file looks like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 260
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 350
comp3044_seq1 460
comp3044_seq1 600
…………….
I want to only print records that are not within +-60 (based on field 2) when they are from the same component (based on field 1). Therefore, it doesn’t matter if they are within +-60 when they are from different components (based on field 1). The output in the above example should look like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 460
comp3044_seq1 600
http://ideone.com/h6oEI