I’m using Html Agility Pack to perform a basic web scraping of Google search results. As a newbie to XPath, I make sure my path expression is correct(with the help of FirePath). However, the returned HtmlNodeCollection is always NULL.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load("http://www.google.com/search?num=10&q=Hello+World");
// get search result URLs
var items = htmlDoc.DocumentNode.SelectNodes("//div[@id='ires']/ol[@id='rso']/li/div[@class='vsc']/h3/a/@href");
foreach (HtmlNode node in items)
{
Console.WriteLine(node.Attributes);
}
Am I missing something? Can anyone please enlighten me?
Thanks in advance,
HAP can only process the raw HTML that is returned from the url, it will not run any additional javascript that is on the page or whatnot. You need to adjust your query accordingly.
In the raw HTML, the
iresdiv exists but thersodoesn’t get inserted until the javascript is run hence you get no results. There are other transformations done here which you’ll have to adjust for as well.Here’s a fragment of the HTML:
A more appropriate xpath to use for this would be:
It’d be easier to find all
liwith thegclass as those correspond to all the results. You’ll want to filter allh3with therclass otherwise you’d include other results (such as image results).