I am trying to extract a specific tag from html (I know from reading on this site that you should not try and parse html with regular expressions, but I only need specific tags, that follows a pretty specific order)
This is the regular expression (tested in Expresso) and working perfectly as it should
(?<ExternalSource2>\<eds2[\s.]+url\=\"?(?<Url>[\w\./:\?=&\+%\d_-]+)\"?[\s.]*\>(?<Text>[\s.]*[\w\s\d]*)\</eds2\>)
The problem comes when trying to use this in C# this code
Regex re = new Regex(@"(?<ExternalSource2>\<eds2[\s.]+url\=\""?(?<Url>[\w\./:\?=&\+%\d_-]+)\""?[\s.]*\>(?<Text>[\s.]*[\w\s\d]*)\</eds2\>)");
string Input = @"width: 662px; height: 60px; vertical-align: middle""><eds2 url=""http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147"">PlaceLogo</eds2></td></tr></tbody></table><table style=""width: 662px; border-collapse: collapse""><tbod";
foreach (Match m in re.Matches(Input)) {
HttpContext.Current.Response.Write(string.Format("Match : {0}<br/>", m));
short i = 0;
foreach (Group g in m.Groups) {
HttpContext.Current.Response.Write(string.Format("Group {0} = {1}<br/>", i++, g.Value));
}
HttpContext.Current.Response.Write("<br/><br/>");
}
Produces this result :
Match : PlaceLogo
Group 0 = PlaceLogo
Group 1 = PlaceLogo
Group 2 = http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147
Group 3 = PlaceLogo
which is not at all what I expect.
When you use the code below though, the result is more what I would expect (but still not quite) :
Regex re = new Regex(@"eds2[\s.]+url\=\""?(?<Url>[\w\./:\?=&\+%\d_-]+)\""?[\s.]*\>(?<Text>[\s.]*[\w\s\d]*)\</eds2\>");
Result :
Match : eds2 url="http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147">PlaceLogo
Group 0 = eds2 url="http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147">PlaceLogo
Group 1 = http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147
Group 2 = PlaceLogo
The expected output is :
Match : <eds2 url="http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147">PlaceLogo</eds2>
Group 0 = <eds2 url="http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147">PlaceLogo</eds2>
Group 1 = <eds2 url="http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147">PlaceLogo</eds2>
Group 2 = http://www.someurl.co.uk/_modules/system/Newsletter.aspx?Username=TBO&Password=N5TBO2&TagID=PlaceLogo&TownID=147
Group 3 = PlaceLogo
Any help appreciated.
I can’t reproduce your problem with your sample code. It creates the following output:
Please clarify your question.
UPDATE:
I guess, your problem is the following:
You write the result of the match directly to your response stream without escaping it. This means, it will be interpreted as HTML and not as text, as you want.
You should change your code to this: