I have an HTML file that it isn’t syntactically correct, I’m parsing it with HTML Agility Pack.
But if I have a link like
<a href="http://google.com/!/!!!">Google</a>
it’s a problem, is there a possible way to detect broken links so that when an error is found (no page is available on that link) the application will store that link in a list and return it?
Same problem on tags, example:
<img hhh="jjj"/>
here the image tag is all wrong, this should be in the ‘errors for repair’ list too.
Thanks in advance.
You need to loop through
Document.DocumentNode.Descendants("a")and check whether thehreftag is bad.Similarly, you can loop through
Document.DocumentNode.Descendants("img")and check forsrcattributes.EDIT:
To check for bad attributes, you can maintain a
Dictionary<string, IEnumerable<string>>that maps tag names to valid attributes, then use LINQ to find missing attributes, like this: