I am trying to pull a specific URL using the the python by using raw_html = urlopen(url).read().
When I inspect ‘raw_htm’ I find that the expected HTML/text has been replaced with some text that essentially tells me that I cannot crawl the site.
However, when I pull the same url using ‘curl -O’ from UNIX/python the page is downloaded just fine.
What is the reason for the discrepancy and what method should I use within python so that I can get the html as I do with the curl command in unix?
Thanks in advance for any thoughts!
When an HTTP client makes a request, it identifies itself to the server. In this case, the server checks whether the client is a bot, and if it is, it refuses access (though apparently it fails to detect Curl).
You can get around this by setting the user-agent string to spoof a browser. See this question for how to do that with
urllib. However, if the server’s owner does not want you to crawl it, and it detects that you’re doing so anyway (because you’re requesting pages at too high a rate), you might find yourself blocked from accessing the site, so contacting the owner might be a better idea than spoofing.