I have some website source stream I am trying to parse. My current Regex is this:
Regex pattern = new Regex (
@"<a\b # Begin start tag
[^>]+? # Lazily consume up to id attribute
id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]? # $1: id
[^>]+? # Lazily consume up to href attribute
href\s*=\s*['""]?([^>\s'""]+)['""]? # $2: href
[^>]* # Consume up to end of open tag
> # End start tag
(.*?) # $3: name
</a\s*> # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );
But it doesn’t match the links anymore. I included a sample string here.
Basically I am trying to match these:
<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>
"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**
In the sample I posted, there are at least 3, I didn’t count the other ones.
Also I use RegexHero (online and free) to see my matching interactively before adding it to code.
For completeness, here how it’s done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).
Loading the document, parsing it, and finding the 3 links is as simple as:
That’s it, really. Now you can easily get the data: