I have some website source stream I am trying to parse. My current Regex

Question

0

Asked: May 25, 20262026-05-25T18:52:10+00:00 2026-05-25T18:52:10+00:00

I have some website source stream I am trying to parse. My current Regex

0

I have some website source stream I am trying to parse. My current Regex is this:

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

But it doesn’t match the links anymore. I included a sample string here.

Basically I am trying to match these:

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

In the sample I posted, there are at least 3, I didn’t count the other ones.

Also I use RegexHero (online and free) to see my matching interactively before adding it to code.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T18:52:11+00:00

For completeness, here how it’s done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).

Loading the document, parsing it, and finding the 3 links is as simple as:

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

That’s it, really. Now you can easily get the data:

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some website source stream I am trying to parse. My current Regex

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply