I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I’m working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
- I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPTelements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.- I need to be able to special-case IE conditional comments and
METAandLINKelements inside IE conditional comments - Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I’ve never used it before and I don’t know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it – no HTML or BODY.) I know I could read the documentation but it’d save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)
Absolutely, that is what it excels at.
In fact, many web pages you’ll find in the wild could be described as HTML fragments, due to missing
<html>tags, or improperly closed tags.The HtmlAgilityPack simulates what the browser has to do – try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.