How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off “CG33289-PC” from the header below based on the stipulation that is occurs after the phrase “FlyBase_Annotation_IDs:” and before the next comma “,” ?
I need to substitute the header with this simplified result “CG33289-PC” and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like – a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps: