I have a table like this: symbol refseq seqname start stop strand Susd4 NM_144796

Question

0

Asked: June 17, 20262026-06-17T06:03:34+00:00 2026-06-17T06:03:34+00:00

I have a table like this: symbol refseq seqname start stop strand Susd4 NM_144796

0

I have a table like this:

symbol  refseq          seqname start           stop            strand
Susd4   NM_144796       chr1    184695027       184826500       +
Ptpn14  NM_008976       chr1    191552147       191700574       +
Cd34    NM_001111059    chr1    196765080       196787475       +
Gm5698  NM_001166637    chr1    31034088        31055753        -
Epha4   NM_007936       chr1    77363760        77511663        -
Sp110   NM_175397       chr1    87473474        87495392        -
Gbx2                    chr1    91824537        91827751        -
Kif1a                   chr1    94914855        94998430        -
Bcl2    NM_009741       chr1    108434770       108610879       -

And I want to extract data with the following conditions:

1) lines that the values in “refseq” column are not missing

2) for the values in the columns “start” and “stop“, only keep one value for each line: if the value in the column “strand” is “+“, take the value in “start“; if the value in the column “strand” is “-“, take the value in “stop“.

And this is what expected:

Susd4   NM_144796   chr1    184695027   +
Ptpn14  NM_008976   chr1    191552147       +
Cd34    NM_001111059    chr1    196765080       +
Gm5698  NM_001166637    chr1        31055753    -
Epha4   NM_007936   chr1        77511663    -
Sp110   NM_175397   chr1        87495392    -
Bcl2    NM_009741   chr1        108610879   -

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T06:03:36+00:00

I would be very tempted to leave the input delimiter unmodified so blanks and tabs separate fields, rather than insisting on tabs only. That means you want records after the first (to skip the headings line) that have six fields:

awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'

If you want to control the output format more, you can dink with OFS, or use printf:

awk 'BEGIN { OFS = "\t" }
     NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'

awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5;
                         printf "%-8s %-12s %s %9s\n", $1, $2, $3, x; }'

There are other ways to handle it, I’m sure…

The first script produces:

Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879

The content is correct, I believe; the formatting can be improved in many ways. The last script produces:

Susd4    NM_144796    chr1 184695027
Ptpn14   NM_008976    chr1 191552147
Cd34     NM_001111059 chr1 196765080
Gm5698   NM_001166637 chr1  31055753
Epha4    NM_007936    chr1  77511663
Sp110    NM_175397    chr1  87495392
Bcl2     NM_009741    chr1 108610879

You can tweak field widths as necessary.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a table like this: symbol refseq seqname start stop strand Susd4 NM_144796

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply