I started with the first code snippet below to search a list of lines and convert all keywords (identified in a separate list) in each line to lower case. For my test list of lines about 800 lines long, the keyword substitution for the entire list of lines takes less than a second as long as my keyword list is 100 items or fewer. When I extend the list to 101 items or more, the processing time jumps to over 9 seconds.
Using the second snippet (where all the patterns for the keyword list are compiled) drops the total processing time back down below 1 second.
Does anyone know why the processing time for the non-compiled substitution search is so sensitive to the number of items searched per input line? I’m surprised it jumps so sharply after 100 keywords.
snippet #1
for line in lines_in:
for keyword in keywords:
rawstring = r'\b' + keyword + r'\b'
line = re.sub(rawstring, keyword, line, 0, re.IGNORECASE)
snippet #2
for i in range(len(keywords)):
re_pattern = re.compile(r'\b' + keywords[i] + r'\b', re.IGNORECASE)
pattern.append(re_pattern)
for line in lines_in:
for i in range(len(keywords)):
line = pattern[i].sub(keywords[i], line, 0)
This is because Python caches the compiled regex internally, and the size of that internal cache is 100 (as can be seen here on line 227. Furthermore, you can see on line 246-247 that when the cache get’s over the max size it is cleared rather than using a more advanced cache invalidation algorithm. This is essentially means that each iteration of your loop is blowing out the cache and causing all 100+ regexes to be recompiled.
The performance is back to “normal” in your second example because it doesn’t rely on the internal cache staying intact to keep compiled regexes around.