I have a table like this:
symbol refseq seqname start stop strand
Susd4 NM_144796 chr1 184695027 184826500 +
Ptpn14 NM_008976 chr1 191552147 191700574 +
Cd34 NM_001111059 chr1 196765080 196787475 +
Gm5698 NM_001166637 chr1 31034088 31055753 -
Epha4 NM_007936 chr1 77363760 77511663 -
Sp110 NM_175397 chr1 87473474 87495392 -
Gbx2 chr1 91824537 91827751 -
Kif1a chr1 94914855 94998430 -
Bcl2 NM_009741 chr1 108434770 108610879 -
And I want to extract data with the following conditions:
1) lines that the values in “refseq” column are not missing
2) for the values in the columns “start” and “stop“, only keep one value for each line: if the value in the column “strand” is “+“, take the value in “start“; if the value in the column “strand” is “-“, take the value in “stop“.
And this is what expected:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -
I would be very tempted to leave the input delimiter unmodified so blanks and tabs separate fields, rather than insisting on tabs only. That means you want records after the first (to skip the headings line) that have six fields:
If you want to control the output format more, you can dink with OFS, or use
printf:There are other ways to handle it, I’m sure…
The first script produces:
The content is correct, I believe; the formatting can be improved in many ways. The last script produces:
You can tweak field widths as necessary.