I am searching for motifs in sequences (which have gaps, that’s why the regular expression looks overcomplex with a -* after each symbol in pattern).
The following code doesn’t stop executing
import re
line = """MRVKE---TRKNY-QH--------S-----W-------GRGLWSLWRW-------------G---T-------MLLG--ML-M----IS-S--A-A-----E-Q---S--WVTVYYGVPVWREATT-TLFCASDAKAYDTEKH-NVWATHACVPTDPNPQEVQL--NVTENFNMWKNNMVDQMHEDIISLWDQSLKPCVQLTPLCVT-LNC-SD------TINA---TTANNTINA----------------TTT-----TPT-----I----NATT-------------ANKSMEIG---------E---MR----NCSFNIT----NM---G-K-KMK--EYALFYN----LDVV---------------SI-----------------D-------E-----------------DNNNK-------------------------------------------TS--------Y---RLK-SCNTSVI-TQACP-KVSFKPIPIHYCAPAGFAILKCND-KKFNGTGPCGNVSTVQCTHGIKPVVSTQLLLNGSLAE-E-EVVIRSENFTNNVKTIIVQLKNPVMINCTRP-NNNTR-KS-I---HM---GP----GQ-A-F-YAT-GAI---IGDIR-QAHCNI--SE-------------------------------------------K--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------E"""
pattern = "[KR]?-*[KR]?-*[KR]-*[^-]?-*[^-]?-*[KR]-*[^-]-*[^-]-*[^-]?-*[^-]?-*[ILVM]-*[^-]-*[ILVF]"
o = re.search(pattern, line)
The code works (that is finishes execution in microseconds) for other lines and for other motifs, e.g. the following ones:
pattern = "[KR]?-*[KR]?-*[KR]-*[^-]?-*[^-]?-*[KR]-*[^-]-*[^-]-*[^-]?-*[^-]?-*[ILVM]-*[^-]-*"
pattern = "[KR]-*[^-]?-*[^-]?-*[KR]-*[^-]-*[^-]-*[^-]?-*[^-]?-*[ILVM]-*[^-]-*[ILVF]"
If the large gap is removed from the end of the line it also works fine.
As a matter of fact vim also fails to finish execution for the particular this regex search.
It looks like this is one of those REs that take super-linear time in naive RE matchers, such as Python’s. You can speed the first pattern up dramatically by rewriting it as something like
where
(?:introduces a non-capturing group.EDIT: the above RE is not entirely equivalent to yours; please correct it. The spirit of it is: use the
{m,n}operation for repetition, since it causes less backtracking.