I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it.
For example, this site (stackoverflow.com), has a site index – as specified by its robots.txt file (https://stackoverflow.com/robots.txt).
However, when you type https://stackoverflow.com/sitemap.xml, you are directed to a 404 page.
How can I implement the same thing on my website?
I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.
First, decide which networks you want to get your actual sitemap.
Second, configure your web server to grant requests from those networks for your sitemap file, and configure your web server to redirect all other requests to your 404 error page.
For nginx, you’re looking to stick something like
allow 10.10.10.0/24;into alocationblock for the sitemap file.For apache, you’re looking to use mod_authz_host‘s
Allowdirective in a<Files>directive for the sitemap file.