I have a list of possible substrings, e.g. ['cat', 'fish', 'dog']. In practice, the list contains hundreds of entries.
I’m processing a string, and what I’m looking for is to find the index of the first appearance of any of these substrings.
To clarify, for '012cat' the result is 3, and for '0123dog789cat' the result is 4.
I also need to know which substring was found (e.g. its index in the substring list or the text itself), or at least the length of the substring matched.
There are obvious brute-force ways to achieve this, I wondered if there’s any elegant Python/regex solution for this.
I would assume a regex is better than checking for each substring individually because conceptually the regular expression is modeled as a DFA, and so as the input is consumed all matches are being tested for at the same time (resulting in one scan of the input string).
So, here is an example:
UPDATE:
Some care should be taken when combining words in to a single pattern of alternative words. The following code builds a regex, but escapes any regex special characters and sorts the words so that longer words get a chance to match before any shorter prefixes of the same word:
END UPDATE
It should be noted that you will want to form the regex (ie – call to re.compile()) as little as possible. The best case would be you know ahead of time what your searches are (or you compute them once/infrequently) and then save the result of re.compile somewhere. My example is just a simple nonsense function so you can see the usage of the regex. There are some more regex docs here:
http://docs.python.org/library/re.html
Hope this helps.
UPDATE: I am unsure about how python implements regular expressions, but to answer Rax’s question about whether or not there are limitations of re.compile() (for example, how many words you can try to "|" together to match at once), and the amount of time to run compile: neither of these seem to be an issue. I tried out this code, which is good enough to convince me. (I could have made this better by adding timing and reporting results, as well as throwing the list of words into a set to ensure there are no duplicates… but both of these improvements seem like overkill). This code ran basically instantaneously, and convinced me that I am able to search for 2000 words (of size 10), and that and of them will match appropriately. Here is the code:
UPDATE: It should be noted that the order of of things ORed together in the regex matters. Have a look at the following test inspired by TZOTZIOY:
This suggests the order matters :-/. I am not sure what this means for Rax’s application, but at least the behavior is known.
UPDATE: I posted this questions about the implementation of regular expressions in Python which will hopefully give us some insight into the issues found with this question.