I have just begun scraping basic text off web pages, and am currently using

Question

0

Asked: May 17, 20262026-05-17T19:13:48+00:00 2026-05-17T19:13:48+00:00

I have just begun scraping basic text off web pages, and am currently using

0

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL’s game summary pages. I think this is kind of an interesting problem so I would post it here.

The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM

Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can’t right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:

/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td

When I try to grab that node / inner text, htmlagilitypack won’t find it. Does anyone see anything strange in the page’s source code that might be stopping me?

I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!

p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T19:13:48+00:00

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.

When I do

 string test = string.Empty;
StreamReader sr = new StreamReader(@"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = @"//table[@id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;

That works fine.. returns a
“COLUMBUS BLUE JACKETSGame 5 Home Game 3”
which I hope is the string you wanted.

Examining the html I couldn’t find a /tbody.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have just begun scraping basic text off web pages, and am currently using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply