What is the fastest performing regular expression that does not match any string? It may seem like a useless thing, but consider a program that takes a mandatory regex as a filter for instance (this is actually my scenario). I’ve tried a few and found b(?<!b) to be the best performer given that b occurs rarely in the input.
Here is a python code I wrote to test different patterns for their speed:
#!/usr/bin/env python
import re
import time
tests = [
r'a\A',
r'b\A',
r'a^',
r'b^',
r'[^\s\S]',
r'^(?<=a)',
r'^(?<=b)',
r'a(?<!a)',
r'b(?<!b)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
timing = []
text = 'a' * 50000000
for t in tests:
pat = re.compile(t)
start = time.time()
pat.search(text)
dur = time.time() - start
timing.append((t, dur))
timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
print('%-30s %0.3f' % (t, dur))
On my machine, I get the following times:
Pattern Time
b(?<!b) 0.043
b\A 0.043
b^ 0.043
$a 0.382
$b 0.382
^(?<=a) 0.395
\Za 0.395
\Zb 0.395
^(?<=b) 0.414
a\A 0.437
a^ 0.440
a(?<!a) 0.796
[^\s\S] 1.469
update: added benchmark for some of suggested regexes.
A single character is a valid regular expression. A single character that is not “magic” matches itself. If you can identify a single character that will never, ever appear in your text, you could make a pattern from that.
How about ASCII NUL, character 0?
I stuck in one more string in your test program, the string:
'\0'It was about as fast as your best pattern:
b(?<!b)Okay, you already have a character after the end of the string. How about a character before the start of the string? That’s impossible:
'x^'Aha! That’s faster than checking for a character after end of string. But it’s about as fast as your best pattern.
I suggest replacing the
bwith an ASCII NUL and calling it good. When I tried that pattern:\0(?<!\0)It wins by a tiny fraction. But really, on my computer, all the ones discussed above are so close together that there isn’t much to distinguish them.
Results:
That was fun. Thanks for posting the question.
EDIT: Ah hah! I rewrote the program to test with real input data, and got a different result.
I downloaded “The Complete Works of William Shakespeare” from Project Gutenberg as a text file. (Weird, it gave an error on
wgetbut let my browser get it… some sort of measure to protect against automated copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txtHere are the results, followed by the modified program as I ran it.
So yeah I’m definitely going with that first one.