Each time a python file is imported that contains a large quantity of static regular expressions, cpu cycles are spent compiling the strings into their representative state machines in memory.
a = re.compile('a.*b') b = re.compile('c.*d') ...
Question: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?
Pickling the object simply does the following, causing compilation to happen anyway:
>>> import pickle >>> import re >>> x = re.compile('.*') >>> pickle.dumps(x) 'cre\n_compile\np0\n(S'.*'\np1\nI0\ntp2\nRp3\n.'
And re objects are unmarshallable:
>>> import marshal >>> import re >>> x = re.compile('.*') >>> marshal.dumps(x) Traceback (most recent call last): File '<stdin>', line 1, in <module> ValueError: unmarshallable object
Not easily. You’d have to write a custom serializer that hooks into the C
sreimplementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.First, have you actually profiled the code? I doubt that compiling regexes is a significant part of the application’s run-time. Remember that they are only compiled the first time the module is imported in the current execution — thereafter, the module and its attributes are cached in memory.
If you have a program that basically spawns once, compiles a bunch of regexes, and then exits, you could try re-engineering it to perform multiple tests in one invocation. Then you could re-use the regexes, as above.
Finally, you could compile the regexes into C-based state machines and then link them in with an extension module. While this would likely be more difficult to maintain, it would eliminate regex compilation entirely from your application.