I’m trying to split up a nucleotide sequence into amino acid strings using a regular expression. I have to start a new string at each occurrence of the string “ATG”, but I don’t want to actually stop the first match at the “ATG”. Valid input is any ordering of a string of As, Cs, Gs, and Ts.
For example, given the input string: ATGAACATAGGACATGAGGAGTCA
I should get two strings: ATGAACATAGGACATGAGGAGTCA (the whole thing) and ATGAGGAGTCA (the first match of “ATG” onward). A string that contains “ATG” n times should result in n results.
I thought the expression /(?:[ACGT]*)(ATG)[ACGT]*/g would work, but it doesn’t. If this can’t be done with a regexp it’s easy enough to just write out the code for, but I always prefer an elegant solution if one is available.
If you really want to use regular expressions, try this:
But be careful with
execand changing the index. You can easily make it an infinite loop.Otherwise you could use
indexOfto find the indices andsubstrto get the substrings: