I have a hierarchical datastructure in Django, and want to match the path to an object in a Django URL pattern. Here’s my pattern:
url(r'^products/(?P<path>(?:[-\w]+\/?)+)/$',
CategoriesListView.as_view(model=Product),
name='product_categories_list'
),
The goal is to match the whole path, but without the trailing slash. The problem is, that on certain input, this regex completely degenerates in performance. The main problem seems to be strings that contain a dot:
In [1]: import re
In [2]: s = re.compile(r'^products/(?P<path>(?:[-\w]+/?)+)/$')
In [3]: s.search('products/111111111111111111111111.c')
This takes around 5 seconds. Making the string longer leads to exponential growth in run time.
How can I rewrite that regex so that it still matches on the same strings, but doesn’t eat my CPU for breakfast?
You could write:
which matches the same set of strings except that it doesn’t enforce the prohibition on nonterminal
//. If that prohibition is important, then you could write:which uses a negative lookahead assertion to enforce that prohibition in a less-expensive way.