I’m using htmlagilitypack to extract several html-tags. Heres what I do:
HtmlDoc = new HtmlDocument();
StringReader sr = new StringReader(decodedHTML);
HtmlDoc.Load(sr);
sr.close();
var anchor_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_ANCHOR + "[@" + HTML.ATTRIBUT_HREF + "]");
var embed_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_EMBED + "[@" + HTML.TAG_EMBED_SRC + "]");
var iframe_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IFRAME + "[@" + HTML.TAG_IFRAME_SRC + "]");
var img_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IMG + "[@" + HTML.TAG_IMG_SRC + "]");
var audio_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_AUDIO); // may contain inner-html
var object_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_OBJECT); // may contain inner-html
var video_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_VIDEO); // may contain inner-html
Where decodedHTML is the html-page packed in a string. After that I examine if the variables above are null
if (anchor_tags != null)
{
ExtractLinks_AnchorTags(anchor_tags);
}
if(audio_tags != null)
{
ExtractLinks_AudioTags(audio_tags);
}
if(embed_tags!=null)
{
ExtractLinks_EmbedTags(embed_tags);
}
if (iframe_tags != null)
{
ExtractLinks_iFrameTags(iframe_tags);
}
if (img_tags != null)
{
ExtractLinks_ImgTags(img_tags);
}
if (object_tags != null)
{
ExtractLinks_ObjectTags(object_tags);
}
if (video_tags != null)
{
ExtractLinks_ObjectTags(video_tags);
}
and some of them are definitly null, because most of the extractLinks-methods aren’t even called. For example when I’m visiting y o u t u b e . c o m . There are several iframe-tags and the code doesnt recognize them.
edit:
when I’m deleting the
"[@" + HTML.TAG_IFRAME_SRC + "]" the iframes are recognized, but I just want to extract the iframes with a src attribute. What’s the correct xpath syntax for it?
HtmlAgilityPack does not load the contents of
iframeelements.In order to inspect the content of an
iframe, read thesrcattribute (which represents theiframe‘s URI) and perform a separate web request to load that into a separateHtmlDocument.Along the way, be aware of these possible issues:
the
srcattribute may contain a relative URI. For example, if you visithttp://www.example.comand see that aniframehassrc="/samplePage", you should convert that first to an absolute URI (in this case,http://www.example.com/samplePage)it is possible that some
iframeelements do not have asrctag, because it is added dynamically, via javascript, when the document is rendered in a browser. It is also possible to create entireiframeelements with javascript, elements that you wouldn’t even see if you just do a regularHttpWebRequest. In cases like these, you have to analyze the javascript present on the page and to duplicate that logic in your program.Update
The XPath expression for
iframeelements that have asrcattribute is://iframe[@src]