I’ve read all related posts and scoured the internet but this is really beating me.
I have some text containing a date.
I would like to capture the date, but not if it’s preceded by a certain phrase.
A straightforward solution is to add a negative lookbehind to my RegEx.
Here are some examples (using findall).
I only want to capture the date if it isn’t preceded by the phrase “as of”.
19-2-11
something something 15-4-11
such and such as of 29-5-11
Here is my regular expression:
(?<!as of )(\d{1,2}-\d{1,2}-\d{2})
Expected results:
[’19-2-11′]
[’15-4-11′]
[]
Actual results:
[’19-2-11′]
[’15-4-11′]
[‘9-5-11’]
Notice that’s 9 not 29. If I change \d{1,2} to something solid like \d{2} on the first pattern:
bad regex for testing: (?<!as of )(\d{2}-\d{1,2}-\d{2})
Then I get my expected results. Of course this is no good because I’d like to match 2-digit days as well as single-digit days.
Apparently my negative lookbehind is quity greedy — moreso than my date capture, so it’s stealing a digit from it and failing. I’ve tried every means of correcting the greed I can think of, but I just don’t know to fix this.
I’d like my date capture to match with the utmost greed, and then my negative lookbehind be applied. Is this possible? My problem seemed like a good use of negative lookbehinds and not overly complicated. I’m sure I could accomplish it another way if I must but I’d like to learn how to do this.
How do I make Python’s negative lookbehind less greedy?
The reason is not because lookbehind is greedy. This happens because the regex engine tries to match the pattern at every position it can.
It advances through the phrase
such and such as of 29-5-11successfully matching(?<!as of )at first, but failing to match\d{1,2}.But then the engine finds the itself in the position
such and such as of !29-5-11(marked with!). But here it fails to match(?<!as of ).And it advances to the next position:
such and such as of 2!9-5-11. Where it successfully matches(?<!as of )and then\d{1,2}.How to avoid it?
The general solution is to formulate the pattern as clear as possible.
In this very case I would prepend the digit with the necessary space or the beginning of the string.
The solution of Mark Byers is also very good.
I think it’s very important to understand the reason why regex engine behaves this way and gives unwanted results.
By the way the solution I gave above doesn’t work if there are 2 or more spaces.
It doesn’t work because the fist position matches here
such and such as of ! 29-5-11with the abovementioned pattern.What can be done to avoid it?
Unfortunately lookbehind in Python regex engine doesn’t support quantifiers
+or*.I think the simplest solution would be to make sure there is not spaces before
(?:^|\s+)(meaing that all the spaces are consumed by(?:^|\s+)straight after any nonspace text (and in case the text isas of, terminate advancing and backtrack to the next starting position starting the search all over again at the next position of the searched text).