I don’t know much about SEO and how web spiders work, so forgive my ignorance here. I’m creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there’s no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I’d like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn’t crawled? Also, would a spider even crawl a page that’s dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings?
edit: I should add, I’m reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I’d also like to prevent any data-mining users who will ignore the robots.txt file.
I appreciate any help!
You can prevent some malicious clients from hitting your server too heavily by implementing throttling on the server. “Sorry, your IP has made too many requests to this server in the past few minutes. Please try again later.” In practice, though, assume that you can’t stop a truly malicious user from bypassing any throttling mechanisms that you put in place.
Given that, here’s the more important question:
Are you comfortable with the information that you’re making available for all the world to see? Are your users comfortable with this?
If the answer to those questions is no, then you should be ensuring that only authorized users are able to see the sensitive information. If the information isn’t particularly sensitive but you don’t want clients crawling it, throttling is probably a good alternative. Is it even likely that you’re going to be crawled anyway? If not, robots.txt should be just fine.