I recently wrote a java crawler program that finds the video links in a web site and save in a text file. But there is a serious problem.
To prevent web page crawling, they use some method that changes the path of videos in the server. I know that they won’t dynamically change the actual path of video links. It’s too costly. However, I can come up with a guess that they encrypt the file paths with some key like session-id.
Now, I get the HTTP 410 – Gone error from the web server. Any ideas how did they prevent crawling and solutions to overcome these guys’ clever method ?
There’s a variety of methods that people can implement to protect their resources from theft / scraping:
If they have copyright claims over the information they publish (or the information isn’t otherwise in the public domain), which is implied if they are trying to prevent this sort of access; then what you are doing is likely to be illegal in most territories around the world.