I’m building a component to ban spam bots’ IPs based on the invalid requests that they make all the time, and that no user could ever make by mistake.
For example, they are always trying to submit empty forms, or making GET requests to urls that should only receive POST requests.
What I want to know is if I am at risk of banning google bots by doing so.
Are they smart enough not to crawl every url they encounter? Do they avoid form urls?
Googlebot follows links. It will only request pages for which it finds a link. Of course, that link doesn’t have to reside on your site and so may not be in your direct control.
Googlebot will only make GET requests because, according to the RFC, GET requests must not have side-effects. Thus, they cannot change state on the server. Hint: Never use a link (i.e. “get”) to perform or confirm some change to your site or any web spider might trigger it.
Every CGI you have that changes the state of your site should verify that the incoming request is indeed a POST, just to be safe.