I have to write a regular expression to get three words from the text. Words are separated with one space. And I wrote the code that gives me not all sequences.
For example for text “one two three four five six” I got only two sequences: 1.one two three 2.four five six. But I want my regular expression to give me all sequences so the output would be: 1.one two three 2.two three four 3.three four five. 4.four five six.
Can somebody tell me please what’s wrong with my regular expression?
Here is my code:
string input = "one two three four five six";
string pattern = @"([a-zA-Z]+ ){2}[a-zA-Z]+";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
Console.WriteLine();
foreach (Match match in matches)
Console.WriteLine(match.Value);
}
Console.ReadLine();
There’s nothing wrong with your regular expression – it’s just how regular expressions work. When you find a match, the search for the next match continues at the end of the one you just found – the width of the match is consumed.
So, how to fix this? One way is to make your match not consume anything. You can do this by placing your original pattern in a zero-width positive lookahead assertion:
(?=pattern)says “only match at this point if it’s immediately followed by soemthing matchingpattern” – but the content matchingpatternisn’t part of the overall match, so it isn’t consumed.If it’s not part of the match, though, it doesn’t appear in
match.Value– so how do you get the value out? Simple – just add a capturing group around the original pattern (i.e.(?=(pattern))), and the captured group will appear in your results as normal.So now, you can go through your
foreachloop as before, butmatch.Valuewill be empty – your desired result is inmatch.Groups[1].Value.But now you have another problem. Your results are
and so on. This is because your pattern matches even when you start halfway through a word.
How to fix this?
We add another zero-width assertion, this time a negative lookbehind:
(?<![a-zA-Z]). Rather than saying “only match if this point is followed by the pattern”, it says “never match if this point is preceeded by the pattern”. Thus we’ll never match at a point preceeded by a letter.ne two threeisn’t returned, for example, as it’s preceeded byo.With this pattern, you finally get your expected results.