I have a regex in c# that i’m using to match image tags and pull out the URL. My code is working in most situations. The code below will “fix” all relative image URLs to Absolute URLs.
The issue is that the regex will not match the following:
<img height="150" width="202" alt="" src="../Image%20Files/Koala.jpg" style="border: 0px solid black; float: right;">
For example it matches this one just fine
<img height="147" width="197" alt="" src="../Handlers/SignatureImage.ashx?cid=5" style="border: 0px solid black;">
Any ideas on how to make it match would be great. I think the issue is the % but I could be wrong.
Regex rxImages = new Regex(" src=\"([^\"]*)\"", RegexOptions.IgnoreCase & RegexOptions.IgnorePatternWhitespace);
mc = rxImages.Matches(html);
if (mc.Count > 0)
{
Match m = mc[0];
string relitiveURL = html.Substring(m.Index + 6, m.Length - 7);
if (relitiveURL.Substring(0, 4) != "http")
{
Uri absoluteUri = new Uri(baseUri, relitiveURL);
ret += html.Substring(0, m.Index + 5);
ret += absoluteUri.ToString();
ret += html.Substring(m.Index + m.Length - 1, html.Length - (m.Index + m.Length - 1));
ret = convertToAbsolute(URL, ret);
}
}
Using RegEx to parse images in this way is a bad idea. See here for a good demonstration of why.
You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax.