I’m trying to extract some DNA info from a file.
Before the DNA data consisting of bases GCAT there is the word ORIGIN, and after there is a //. How do I write a regular expression to get these bases between these markers?
I have tried the following but it doesn’t work.
[ORIGIN(GCATgcat)////]
Sample data:
ORIGIN
1 acagatgaag acagatgaag acagatgaag acagatgaag
2 acagatgaag acagatgaag acagatgaag acagatgaag
//
Try this pattern “
\\b([GCATgcat]+)\\b” which matches any GCAT character sequence (upper or lowercase) surrounded by a word boundary (so it wouldn’t match those characters embedded in other strings, like the word “catalog”). If you repeatedly scan for this regex in your sample file you will extract each sequence.Here’s a working example for your sample file: