I am using perl and regular expression to find an ORF (open reading frame) with a minimal size of 45 bases using.
Basically it means:
Find a substring a string that is composed ONLY of the letters ATGC (no spaces or new lines) that:
- Starts with “ATG”
- ends with “TAG” or “TAA” or “TGA”,
- is at least 39 chars long
- is dividable by 3
My first code was:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if($CDSString =~ m/(ATG.{45,}(TAG|TAA|TGA))/)
{
my $CDSCurrent = $1;
if ((length($CDSCurrent) % 3) == 0)
{
# do something
}
}
which works fine, but I thought there might be a better way.
So I tried:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if ($CDSString =~ m/ATG(...){13,}(TAG|TAA|TGA)/ )
{
# do something
}
but for some reason it doesn’t match the string above it, and I can’t figure out why.
Can anyone figure it out? Thank you in advance.
Your regex is not making sure that everything between the start and stop codons is in fact composed of the letters
ATGConly. You should be using:(But your original regex works, too, it just won’t reject invalid matches. So there may be another problem somewhere else.)