I am using the html agility pack and did something like this
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://test.com");
int count = doc.DocumentNode.SelectNodes("//img").Count();
I get 38 back.
When I go to that page and do $('img').size(); I get 43 back. Why is there a difference? Is "//img" just looking for root ones?
Is that why I might be missing some?
No it looking for descendant nodes (children, grandchildren, etc. of the current node). Your xpath expression selects all the images from the document.
My assumption – some of the images are created dynamically via javascript. HtmlAgilityPack cannot handle this.
By the way, for the
http://test.comI got 87 image nodes with AgilityPack (doc.DocumentNode.SelectNodes("//img").Count()), and 87 image nodes from the Chome console ($('img').size()).EDIT:
HtmlWeb.Load()method internally usesWebRequestclass to get data. The role of AgilityPack is to parse the data correctly. It’s completely possible that some web resources return different content for the same URI depending on some of request headers likeUser-Agentand others. E.g.User-Agentheader could be set viaHtmlWeb.UserAgentproperty.