Are there websites which identify it as a script that is accessing it , inspite of changing the User-Agent headers which I assume is like this and gives an error.
import urllib,urllib2
req_headers = {'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url,headers = req_headers)
html = req.open(url)
If yes , then how?
Yes. For starters, look at your complete header when browsing the web using a tool like Firebug. You’ll notice normal browsers provide a lot of information such as languages accepted that is not provided by
urllib. So a website might check for the presence of other header information.Another trick would be to include a 1×1 pixel image on a page and check if the client requested the image file. If not, then the client is using either a text only browser (like lynx) or is actually a script. I think JavaScript can also be used to look for the presence of a mouse.
Generally, it’s a game of cat and mouse. One alternative to
urllibis Selenium. Selenium will launch a browser window.