I am trying to scrape some information of a page using the jsdom.env function. However, the page that gets returned in the env() callback is about how access is denied to the server instead of the content that I am hoping to see when I load the same URL in a browser.
Thus, there seems to be a difference in how the browser loads the page vs. how jsdom is loading it. Is this something which can be configured in the jsdom module?
Edit:
Example URL: http://www.bestbuy.com/site/HP+-+20%22+Widescreen+Flat-Panel+LCD+Monitor/1422209.p?id=1218257754431&skuId=1422209
Update:
The issue was jsdom not specifying the user-agent http header. Look at the detailed answer below
The problem was that
jsdomis not specifying a ‘User-Agent’ http header, which the bestbuy.com server are checking for. If its empty, access is denied. Currently, there is no way of specifying this throughjsdom– https://github.com/tmpvar/jsdom/issues/196A workaround that worked for me to use the
requestmodule to get the page content and then pass then on tojsdomto work on. Therequestmodule allows you to specify a user agentExample: