I have a helper method called HighlightKeywords, which I use on a Forum when viewing search results, to highlight the keyword(s) within the posts, that the user has searched on.
The problem I have is that, say for example the user searches for the keyword ‘hotmail’, where the HighlightKeywords method then finds matches of that keyword and wraps it with a span tag specifying a style to apply, it’s finding matches within html anchor tags and in some cases image tags. As a result, when I render the highlighted posts to screen, the html tags are broken (due to the span being inserted within them).
Here is my function:
public static string HighlightKeywords(this string s, string keywords, string cssClassName)
{
if (s == string.Empty || keywords == string.Empty)
{
return s;
}
string[] sKeywords = keywords.Split(' ');
foreach (string sKeyword in sKeywords)
{
try
{
s = Regex.Replace(s, @"\b" + sKeyword + @"\b", string.Format("<span class=\"" + cssClassName + "\">{0}</span>", "$0"), RegexOptions.IgnoreCase);
}
catch {}
}
return s;
}
What would be the best way to prevent this from breaking? Even if I could just simply exclude any matches that occur within anchor tags (whether they be web or email addresses) or image tags?
No. You can’t do that. At least, not in a way that won’t break. Regular Expressions are not up to the task of parsing HTML. I am really sorry. You will want to read this rant too: RegEx match open tags except XHTML self-contained tags
So, you will probably need to parse the HTML (I hear the HtmlAgilityPack is good) and then only match inside certain portions of the document – excluding anchor tags etc.