I have a string on which I try to create a regex mask that will show N number of words, given an offset. Let’s say I have the following string:
"The quick, brown fox jumps over the lazy dog."
I want to show 3 words at the time:
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."
I’m using Python and I’ve been using the following simple regex to detect 3 words:
>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\w+\W*){3}', s).group()
'The quick, brown '
But I can’t figure out how to have a kind of mask to show the next 3 words and not the beginning ones. I need to keep punctuation.
The prefix-matching option
You can make this work by having a variable-prefix regex to skip the first
offsetwords, and capturing the word triplet into a group.So something like this:
Let’s take a look at the pattern:
This does what it says: match
2words, then capturing into group 1, match3words.The
(?:...)constructs are used for grouping for the repetition, but they’re non-capturing.References
Note on “word” pattern
It should be said that
\w+\W*is a poor choice for a “word” pattern, as exhibited by the following example:There are no 3 words, but the regex was able to match anyway, because
\W*allows for an empty string match.Perhaps a better pattern is something like:
That is, a
\w+that is followed by either a\W+or the end of the string$.The capturing lookahead option
As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. It uses
findallto capture all matches (see on ideone.com):How this works is that it matches on zero-width word boundary
\b, using lookahead to capture 3 “words” in group 1.References