I have a large Regular Expression I use for parsing my own file format similar to lua. This works fine, except somehow numbers inside quotes get matched twice even though split shouldn’t return overlapping results. I’ve simplified it down to this console app. Any ideas?
static void Main(string[] args)
{
string pattern = "(\r\n)|(\"(.*)\")"; // Splits at \r\n and anything in "quotes"
string input = "\"01\"\r\n" + // "01"
"\"02\"\r\n" + // "02"
"\"03\"\r\n"; // "03"
string[] results = Regex.Split(input, pattern );
foreach (string result in results )
{
//This just filters out the split \r\n and empty strings in results
if (string.IsNullOrWhiteSpace(result) == false)
Console.WriteLine(result);
}
Console.ReadLine();
}
Returns:
"01"
01
"02"
02
"03"
03
From the documentation:
You have two sets of capturing parenthesis, one inclusive of the quotes and one exclusive. These return the strings you are seeing.
Note that the pattern for
RegEx.Splitisn’t supposed to match the desired results, it’s supposed to match the delimiters. A quoted string is usually not a delimiter.Also, your results seem very odd, because you’ve used a greedy match. Apparently the requirement “The input string is split as many times as possible.” makes matching non-greedy for the entire operation.
Overall, I’d say you’re using the wrong tool. Regular expressions are, depending on implementation, incapable of dealing with nested groupings or extremely inefficient. A simple DFA should work much better and never need more than a single scan.