On a Linux, Apache, PHP site, I need to make certain that a subdirectory /cms, on my website is not crawlable by the search engines.
See, in the root of the site, I have installed a product catalog called Pinnacle Cart. They wanted a News page that pulls content from a CMS. I brought WordPress online in a subdirectory called /cms, created some posts, and then used the following code to bring that into my Pinnacle Cart theme:
<?php require_once('../../../cms/wp-blog-header.php'); ?>
<?php $i = 1; $MAX_ARTICLES_TO_SHOW = 5; ?>
<?php while (have_posts()): the_post(); ?>
<div <?php post_class() ?> id="post-<?php the_id(); ?>">
<h2><?php the_title(); ?></h2>
<div class="entry">
<?php the_content(); ?>
</div><!-- .entry -->
<div style="clear:both;"> </div>
<small><?php the_time('F j, Y') ?></small>
</div><!-- #post-... -->
<?php ++$i; if ($i > $MAX_ARTICLES_TO_SHOW) { break; } ?>
<?php endwhile; ?>
Note that some of the images used in the posts will pull from /cms, and I want those to load okay, but I don’t want Google or any search engine to follow anything under /cms.
Note also in WordPress in /cms, I checked off the setting “Do not let sites like Google, Technorati, etc. index this site.”
I’m thinking I’ll need to either adjust the default theme for the WordPress under /cms/wp-content/themes, or put some sort of .htaccess setting in the /cms or / (root) folder of the site.
You can add this to your robots.txt file.
Reads more about it at http://www.robotstxt.org/robotstxt.html
Search engines and scrapers can always ignore this though (Most large search engines will follow the rules). You could check the
$_SERVER['HTTP_USER_AGENT']too, but this can be faked. There is no 100% way of stopping scrapers.