Some other website use cURL and fake http referer to copy my website content.
Do we have any way to detect cURL or not real web browser ?
Some other website use cURL and fake http referer to copy my website content.
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.
I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a “crawler” is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.
I give you some examples of protections you may try.
Sessions per IP
If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.
It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.
Idea :
1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions
2) at the beginning of your script, you delete entries too old from both tables
3) next you check if ip of your user is banned or not (you set a flag to true)
4) if not, you count how much he has sessions for his ip
5) if he has too much sessions, you insert it in your banned table and set a flag
6) you insert his ip on the sessions per ip table if it has not been already inserted
I wrote a code sample to show in a better way my idea.
Visit Counter
If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?
Idea :
Sample code :
An image to download
When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.
This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).
Idea :
1/ Add those lines to your .htaccess
2/ Create your logo.php with the security
3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.
Sample code :
Cookie check
You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).
The idea is quite simple : this is about the same as an image check.
Code :
Protection against proxies
Some words about the different kind of proxies we may find over the web :
It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.
Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):
You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your
$_SERVERvariable.Conclusion
There is a lot of ways to detect abuses on your website, so you’ll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your “normal” users.