Hi I’m going through regular expressions but I’m confused about metacharacters, particularly ‘*’ and ‘?’.
‘*’ is supposed to match the preceding character 0 or more times.
For example, ‘ta*k’ supposedly matches ‘tak’ and ‘tk’.
But I wouldn’t have thought this to be true at all – here’s my reasoning:
for tak:
regexp: I need a ‘t’
string: I have ‘t’
regexp: okay, your next character needs to be an ‘a’
string: yes it is
regexp: okay, keep giving me characters until your character isn’t an ‘a’
string: okay. I’ve just given you ‘k’
regexp: okay, your next character needs to be a ‘k’
string: I don’t have any more characters left!
regexp: fail
for tk:
regexp: I need a ‘t’
string: I have ‘t’
regexp: okay, your next character needs to be an ‘a’
string: no, it’s a ‘k’
regexp: fail
Can someone clarify for me why ‘tak’ and ‘tk’ matches ‘ta*k’?
*does not mean to match a character zero or more times, but an atom zero or more times. A single character is an atom, but so is any grouping.And
*means zero or more. When the regex cursor has “swallowed” thet, the positions are:The regex engine then tries and eats
as as much as possible. Here there is one. After it has swallowed it, the positions are:Then the
kis swallowed:End of regex, match. Note that the string may have other characters behind, the regex engine doesn’t care: it has a match.
In the case where the string is
tk, beforea*the positions are:But
*can match an empty set ofas, thereforea*is satisfied! Which means the positions then become:Rinse, repeat. Now, let’s take
taakas an input andta?kas a regex: this will fail, but let’s see how…Which is why it is VERY important to make regexes fail FAST.