I’m struggling for over a day with what I’d though would be an easy thing.
I need to parse a page’s HTML to find some structured data.
Here’s the test string:
<option value="0794">0794 - SANTA MARIA</option>
<option value="0795">0795 - ALICE COUTINHO</option>
<option value="0800">0800 - T.LARANJEIRAS (CIRCULAR A E B) - VIA T. CARAPINA/J. CAMBURI</option>
<option value="0801">0801 - T. LARANJEIRAS / T. CARAPINA - VIA VALPARAISO / J. LIMOEIRO</option>
<option value="0802">0802 - DIVINOPOLIS / T.LARANJEIRAS VIA CENTRO DA SERRA</option>
And here’s the Regex pattern:
^\s+<option value="\d+">(?<linha>\d+) - (?<nome>(.*?))</option>$
When debugging with Visual Studio 2010 it find no matches.
Full code:
var pattern = @"^\s+<option value=""\d+"">(?<linha>\d+) - (?<nome>(.*?))</option>$";
var regex = new Regex(pattern, RegexOptions.Multiline);
var matches = regex.Matches(html)
html is the test string and matches.Count is always 0.
I’ve already tested on http://regexhero.net/tester/ and on http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx and it works perfectly.
Any help would be appreciated.
I see two problems. First, there’s the
^\s+at the beginning of the regex. In Multiline mode,^matches the position following a linefeed.\s+matches one or more whitespace characters. But there aren’t any whitespace characters after the linefeeds. If you think there might be space or tab characters at the beginning of the line, you should change the+to*; otherwise, just drop the\s+.Second, the regex ends with
$, which matches just before a linefeed. But when I copied the text from your post, the lines ended with\r\n(carriage-return + linefeed), and you aren’t accounting for the\r.When I change the
^\s+to^and the$to\r?$, I get five matches. By the way, the second problem is .NET’s fault, not yours;$in multiline mode should match before\r, as detailed here.