Task
I’m supposed to create an application that extracts the name of an item from an Amazon.com webpage.
Action
I thought I would used the Html Agility Pack to get this done, and I think I’ve got a solution going, but there is one bug that keeps persisting.
Result
Lets say I have tried to pull the html source from n different sites, the application always uses the source of the first site for sites 1 – n and I’m not sure why. I can extract html from a different website if and only if I restart my computer.
Code
private void extractHTML()
{
//retreive URL
string address = txtURL.Text;
string itemId = "result_0";
//create document
HtmlWeb webGet = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument document = webGet.Load(address);
//look for name of result
HtmlNode node = document.GetElementbyId(itemId);
if(node != null)
{
IEnumerable<HtmlNode> allH3 = node.Descendants("h3");
foreach (HtmlNode h3 in allH3)
{
if (h3.ChildNodes[1].InnerHtml == null)
{
continue;
}
else
{
lblId.Text = itemId;
//dig down to lowest subnode to get correct InnerHtml
HtmlNode subNode = h3.ChildNodes[1];
if (subNode.ChildNodes.Count > 0)
{
lblName.Text = subNode.ChildNodes[subNode.ChildNodes.Count - 1].InnerHtml;
break;
}
else
{
lblName.Text = h3.ChildNodes[1].InnerHtml;
break;
}
}
}
}
}
Help is much appreciated! Thanks in advance.
If, as stated in the comments, you’re targeting a page such as http://www.amazon.com/s/ref=nb_sb_ss_i_0_5?url=search-alias%3Daps&field-keywords=radio&sprefix=radio%2Caps%2C182 to try to get all item names, then the following code:
will output this:
The XPATH expression will just get all SPAN elements that have a CLASS attribute set to ‘lrg bold’. To find that, I just looked at the saved version of the HTML and determined a good discriminant for the item names.
I suggest you learn a bit of XPATH, as it’s very powerful. A good tutorial is here: XPATH Tutorial