I just gone through the concept Zero-Width Assertions from the documentation. And some quick questions comes into my mind-
- why such name
Zero-Width Assertions? - How the
Look-aheadandlook-behindconcept supports such
Zero-Width Assertionsconcept? - What such
?<=s,<!s,=s,<=s– 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on
I also tried some tiny codes to understand the logic, but not that much confident with the output of those:
irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
Can anyone help me here to understand?
EDIT
Here i have tried two snippets one with “Zero-Width Assertions” concepts as below:
irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
and the other is without “Zero-Width Assertions” concepts as below:
irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"
Both the above produces same output,now internally how the both regexp move by their own to produce output- could you help me to visualize?
Thanks
Regular expressions match from left to right, and move a sort of “cursor” along the string as they go. If your regex contains a regular character like
a, this means: “if there’s a letterain front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something’s wrong; back up and try something else.” So you might say thatahas a “width” of one character.A “zero-width assertion” is just that: it asserts something about the string (i.e., doesn’t match if some condition doesn’t hold), but it doesn’t move the cursor forwards, because its “width” is zero.
You’re probably already familiar with some simpler zero-width assertions, like
^and$. These match the start and end of a string. If the cursor isn’t at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don’t actually move the cursor forwards, because they don’t match characters; they only check where the cursor is.Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn’t move the cursor.
Consider:
This will match! The regex engine goes like this:
|foo.(?=foo). This means: only match iffooappears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn’t move, because this is zero-width. We still have|foo.f. Is there anfin front of the cursor? Yes, so proceed, and move the cursor past thef:f|oo.o. Is there anoin front of the cursor? Yes, so proceed, and move the cursor past theo:fo|o.foo|.On your four assertions in particular:
(?=...)is “lookahead”; it asserts that...does appear after the cursor.The “ju” in “jump” matches because an “m” comes next. But the “ju” in “june” doesn’t have an “m” next, so it’s left alone.
Since it doesn’t move the cursor, you have to be careful when putting anything after it.
(?=a)bwill never match anything, because it checks that the next character isa, then also checks that the same character isb, which is impossible.(?<=...)is “lookbehind”; it asserts that...does appear before the cursor.The “our” in “four” matches because there’s an “f” immediately before it, but the “our” in “flour” has an “l” immediately before it so it doesn’t match.
Like above, you have to be careful with what you put before it.
a(?<=b)will never match, because it checks that the next character isa, moves the cursor, then checks that the previous character wasb.(?!...)is “negative lookahead”; it asserts that...does not appear after the cursor.“child” matches, because what comes next is a space, not “ren”. “children” doesn’t.
This is probably the one I get the most use out of; finely controlling what can’t come next comes in handy.
(?<!...)is “negative lookbehind”; it asserts that...does not appear before the cursor.The “oot” in “foot” is fine, since there’s no “r” before it. The “oot” in “root” clearly has an “r”.
As an additional restriction, most regex engines require that
...has a fixed length in this case. So you can’t use?,+,*, or{n,m}.You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I’ll never have to maintain, so I don’t have any great examples of real-world applications handy; honestly, they’re weird enough that you should try to do what you want some other way first. 🙂
Afterthought: The syntax comes from Perl regular expressions, which used
(?followed by various symbols for a lot of extended syntax because?on its own is invalid. So<=doesn’t mean anything by itself;(?<=is one entire token, meaning “this is the start of a lookbehind”. It’s like how+=and++are separate operators, even though they both start with+.They’re easy to remember, though:
=indicates looking forwards (or, really, “here”),<indicates looking backwards, and!has its traditional meaning of “not”.Regarding your later examples:
Yes, these produce the same output. This is that tricky bit with using lookahead:
fores|ight.(?!s). Is the character after the cursors? No, it’si! So that part matches and the matching continues, but the cursor doesn’t move, and we still havefores|ight.ight. Doesightcome after the cursor? Well, yes, it does, so move the cursor:foresight|.The cursor moved over the substring
ight, so that’s the full match, and that’s what gets replaced.Doing
(?!a)bis useless, since you’re saying: the next character must not bea, and it must beb. But that’s the same as just matchingb!This can be useful sometimes, but you need a more complex pattern: for example,
(?!3)\dwill match any digit that isn’t a 3.This is what you want:
This asserts that
sdoesn’t come beforeight.