I’m trying to find certain keywords in a string with python. The string is

Question

0

Asked: May 23, 20262026-05-23T00:16:30+00:00 2026-05-23T00:16:30+00:00

I’m trying to find certain keywords in a string with python. The string is

0

I’m trying to find certain keywords in a string with python. The string is something like this:

A was changed from B to C

all I’m trying to find is the “to C” part, where C is one of many thousand words.

This code builds the regexp string:

pre_pad = 'to '
regex_string = None
for i in words:
    if regex_string == None:
        regex_string = '\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)
    else:
        regex_string = regex_string + '|\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)

And later on I do:

matches = []
for match in re.finditer(r"%s" %regex_string, text):
        matches.append([match, MATCH_TYPE])

This code works on linux but crashes on macos with “Caught OverflowError while rendering: regular expression code size limit exceeded”

I realize that the regex_string is very long and that this is the cause of the problem

print regex_string.__len__()
63574

how can I fix this so this will always work, independent of the number of words?

EDIT:

I forgot to mention that the pre_pad is sometimes empty: pre_pad = ”, so searching for pre_pad first is not always possible.

In addition to that, the reason why I build the entire regex_string first and then match it against the words is that I have to do this matching for many thousand entries. If I had to build the regex_string every single time again, this would lead to very poor performance.

Oh, and I need to know which word matches.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T00:16:30+00:00

This is not supposed to be a task you can solve with a huge regexp and expect better performances than this:

pre_pad = 'to '
matches = []

for i in words:
    regex_string = '\\b%s%s(?!-)(?!_)\\b' % (pre_pad, i)
    for match in re.finditer(r"%s" % regex_string, text):
        matches.append([match, MATCH_TYPE])

Also if, after profiling your code you see chained regexp work faster calculate your regexp string length while building it and split the full task in 2, 3, 10 to avoid overflow.

P.S.:

print len(regex_string)

is more pythonic…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to find certain keywords in a string with python. The string is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply