First post here and it’s and awk question.
I have a file that looks like this:
Motif name class from to strand sequence score
>ENSBTAG00000000436
MA0079.2 SP1 Zinc-coordinating 29 38 - agggggtggg 6.33
... (50 similar lines)
>ENSBTAG00000000380
MA0113.1 NR3C1 Zinc-coordinating 92 109 - ccagaaagtgcttctccc 7.03
... (57 similar lines)
and so on. Note that the >ENSBTA line is a ‘label’ for a set of records. So what I’d like is for the >ENSBTA line to be appended as a field in the lines beginning MA… i.e.
MA0079.2 SP1 Zinc-coordinating 29 38 - agggggtggg 6.33 >ENSBTAG00000000436
So far I have
awk '{if (NR>1&&NF==1) genename=$1; if (NR>1&&NF>1) print $0, genename}'
Which is quite close but it doesn’t keep the ENST identifiers with the right lines. So referring to the example above not all 57 lines of the second part of the file get the correct identifier (ENSBTAG00000000380).
Could someone please suggest the best way to go about this?
Thanks
Iain
Not tested, but something like this ought to be close to what you want:
That says essentially if the line matches “>ENSBTAG” at the beginning of the line, save that line in a variable, but if it matches “MA” at the beginning of the line, print the line with the latest tag appended.
If your record lines don’t always match “^MA” (all the ones in the example do, but I don’t want to assume that), or if the tags sometimes look a bit different, you’ll need to modify the regexps accordingly.