I’ve again got a strange problem:
I’m writing a crawler to index a specific site. For some weeks it worked fine and I only ran into problems when sending too many requests per hour.
But now I can’t even access a single page.
But what’s even stranger: I have to submit some form values via POST, but the server returns a 404 error – although the URL is definitely correct.
I implemented many techniques to prevent beeing recognized as a bot: changing user-agent, delays, and I’m sending a Referer-header to pretend the form was submitted from their own website.
May this again be a Spam- or DDOS-protection on their server? Or are there other possible sources of error?
Okay, just solved it.
A very strange behaviour of the remote server caused the problem: when sending more parameters than expected, it returned 404 instead of ignoring not needed parameters.