From the Python docs for re.compile():
Note The compiled versions of the most recent patterns passed to
re.match(), re.search() or re.compile() are cached, so programs that
use only a few regular expressions at a time needn’t worry about
compiling regular expressions.
However, in my testing, this assertion doesn’t seem to hold up. When timing the following snippets that use the same pattern repeatedly, the compiled version is still substantially faster than the uncompiled one (which should supposedly be cached).
Is there something I am missing here that explains the time difference?
import timeit
setup = """
import re
pattern = "p.a.t.t.e.r.n"
target = "p1a2t3t4e5r6n"
r = re.compile(pattern)
"""
print "compiled:", \
min(timeit.Timer("r.search(target)", setup).repeat(3, 5000000))
print "uncompiled:", \
min(timeit.Timer("re.search(pattern, target)", setup).repeat(3, 5000000))
Results:
compiled: 2.26673030059
uncompiled: 6.15612802627
Here’s the (CPython) implementation of
re.search:and here is
re.compile:which relies on
re._compile:So you can see that as long as the regex is already in the dictionary, the only extra work involved is the lookup in the dictionary (which involves creating a few temporary tuples, a few extra function calls …).
Update
In the good ole’ days (the code copied above), the cache used to be completely invalidated when it got too big. These days, the cache cycles — dropping the oldest items first. This implementation relies on the ordering of python dictionaries (which was an implementation detail until python3.7). In Cpython before python3.6, this would have dropped an arbitrary value out of the cache (which is arguably still better than invalidating the whole cache)