Lets say this is our text:
text = 'After 1992 , the winter and summer Olympics will be held two years apart , with the revised schedule beginning with the winter games in 1994 and the summer games in 1996 . ) Now , Mr. Pilson -- a former college basketball player who says a good negotiator needs `` a level of focus and intellectual attention similar to a good athlete-s is facing the consequences of his own aggressiveness . Next month , talks will begin on two coveted CBS contracts'
print re.search(r'(\w+ |\W+ ){0,4}1992( \W+| \w+){4}', text).group(0)
Output: After 1992 , the winter and
But this one gives me:
print re.search(r'(\w+ |\W+ ){0,4}1992( \W+| \w+){0,4}', text).group(0)
Output: After 1992 ,
It seems strange for me because why the second regex is not greedy?
This one is a bit strange than others:
print re.search(r'(\w+ |\W+ ){0,4}summer( \W+| \w+){0,4}', text).group(0)
Output , the winter and summer Olympics will be held
Questions
1- What is the difference between the first and the second one. For me, it should give the same text because the only difference is {0,4} and if {4} gives long string, {0,4} should give the same string because regex is greedy.
2- The problem may be related punctuation because third example works same both {0,4} and {4}..
I am confused.
No mystery here.
In your second example,
␣\W+overmatched␣,␣(blank␣is also part of the\Wclass), so no subsequent matches were found for␣\w+against the remainingthe␣winter␣...— but the{0,4}constraint was satisfied, so no need for those further matches. So far so good.Coming back to your first example, the match above did not satisfy
{4}, so the engine kept looking. In the␣\W+match it backtracked the last blank␣so␣\W+only matched␣,, then 3 subsequent matches for␣\w+could be made against␣the␣winter␣...— and{4}was satisfied.Change your regular expression to either
([^ ]+ +){0,4}my_word( +[^ ]+){0,4}(this maintains the spirit of your original expression, treat spaces as separators and everything else, including punctuation, as words) or, maybe better,(\w+\W+){0,4}my_word(\W+\w+){0,4}to isolate up to 4 actual words on either side irrespective of punctuation.Later,
Aha. It matched part in Department.
(^|(\w+\W+){1,5})\W*my_word\W*((\W+\w+){1,5}|$), this should isolate the word between separators and/or line ends.(\w+\W+){0,5}\w*my_word\w*(\W*\w+){0,5}