I’m trying to find certain keywords in a string with python. The string is something like this:
A was changed from B to C
all I’m trying to find is the “to C” part, where C is one of many thousand words.
This code builds the regexp string:
pre_pad = 'to '
regex_string = None
for i in words:
if regex_string == None:
regex_string = '\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)
else:
regex_string = regex_string + '|\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)
And later on I do:
matches = []
for match in re.finditer(r"%s" %regex_string, text):
matches.append([match, MATCH_TYPE])
This code works on linux but crashes on macos with “Caught OverflowError while rendering: regular expression code size limit exceeded”
I realize that the regex_string is very long and that this is the cause of the problem
print regex_string.__len__()
63574
how can I fix this so this will always work, independent of the number of words?
EDIT:
I forgot to mention that the pre_pad is sometimes empty: pre_pad = ”, so searching for pre_pad first is not always possible.
In addition to that, the reason why I build the entire regex_string first and then match it against the words is that I have to do this matching for many thousand entries. If I had to build the regex_string every single time again, this would lead to very poor performance.
Oh, and I need to know which word matches.
This is not supposed to be a task you can solve with a huge regexp and expect better performances than this:
Also if, after profiling your code you see chained regexp work faster calculate your regexp string length while building it and split the full task in 2, 3, 10 to avoid overflow.
P.S.:
is more pythonic…