I was working on a simple application to pull some currency conversions from a website, when I received an error message (below) stating they had a no automated extraction policy.
Autoextraction Prohibited
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
I don’t really have an intention of breaking their policy but I am curious as to how they can tell. Can anyone enlighten me?
1) User-Agent
2) Introducing a Javascript pop-up.Something like
Click OK to enter.3) Calculating number of request/hour from a particular ip address if you are not behind NAT.
For more detail take a look at this Pycon talk web-strategies-for-programming-websites-that-don-t-expected-it by asheesh laroia.
Also take a look at A Standard for Robot Exclusion.
Some web-sites also use
4) Captchas and Re-Captchas
5) Redirection which means you need to add a
HTTP Referrerto get your data.