everyone !
I am wondering is there a simple way to block automatic content crawler on a shared web host (LAMP, no root access).
For example. I have a large collection of jpg images, and someone decided to make a automatic program (php or others) to download all my image data.
I was thinking of using javascript to decrypt the image at client-side, thus make it more difficult or more effort to collect all the data by the crawler. But I am not sure the impact on browsers without javascript support, and the effectiveness on preventing such crawler.
Of course, good search engine crawler should be allowed.
Apart from images, what about text, audio or video content ? How should I deal with them ?
Unless your content is hidden behind some form of authentication, then anyone who seriously tries will be able to get your content. That said, you can take some measures to make it a little more difficult using your
.htaccessfile.To prevent hotlinking (referencing your files from another site), you can add the following to block access to anything that ends with gif, jpg, js, or css and doesn’t have your site as the
HTTP_REFERER:You can also block access by user agent (full list here):
And block by IP if you have identified “bad” bots you want to block: