I’m looking for the best solution on how I can ensure I am doing this correctly:
I have a calendar on my website, in which users can take the calendar iCal feed and import it into external calendars of their preference (Outlook, iCal, Google Calendar, etc…).
To deter bad people from crawling/searching my website for the *.ics files, I’ve setup Robots.txt to disallow the folders in which the feeds are stored.
So, essentially, an iCal feed might look like: webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics
I understand the above is still a public URL. However, I have a function in which the user can change address of their feed, if they want.
My question is: All external calendars have no problem importing/subscribing to the calendar feed, except for Google Calendar. It throws the message: Google was unable to crawl the URL due to a robots.txt restriction. Google’s Answer to This.
Consequently, after searching around, I’ve found that the following works:
1) Setup a PHP file (which I am using) that essentially forces a download of the file. It basically looks like this:
<?php
$url = "/home/path/to/local/feed/".$_GET['url'];
$file = fopen ($url, "r");
if (!$file) {
echo "<p>Unable to open remote file.\n";
exit;
}
while (!feof ($file)) {
$line = fgets ($file, 1024);
print $line;
}
fclose($file);
?>
I tried using this script, and it appeared to work with Google Calendar, with no issues. (Although, I’m not sure if it updates/refreshes yet. I’m still waiting to see if this works).
My question is this: Is there a better way to approach such an issue? I’d like to keep the current Robots.txt in place to disallow crawling my directories for *.ics files and keep the files hidden.
Looks to me you have two problems:
accessing the website.
installing robots.txt, allow
Googlebot access your site.
The first problem cannot be solved by robots.txt. As Marc B points out in comment, robots.txt is a purely voluntary mechanism. In order to block badbots once for all, I will suggest you using some kind of behavior-analysis program/firewall to detect bad bots and deny access from these IPs.
For the second problem, robots.txt do allow you whitelist a particular bot. Check http://facebook.com/robots.txt as example. Noted that Google identify their bots in different names (for Adsence, search, image search, mobile search), I am not if the Google calendar bot uses the generic Google bot name or not.