I would like to scrape a web site. It has the following in it’s robots.txtfile, but I’m not exactly sure what it is they don’t want me to do:
User-agent: *
Disallow: /click
There is no click subdirectory. Or they don’t want me to access anything that would normally require clicking (like submitting data via a form)? They sure aren’t making it easy in any case – the main page’s form GETS to a site that sets a cookie that is read by a third page.
It means that no bot should crawl any URLs whose paths start with the string
click.For example, the following URLs should be blocked:
example.com/clickexample.com/click.htmlexample.com/click/example.com/click/foo/barexample.com/clickerThe following URLs would still be allowed:
example.com/foo/clickexample.com/fooclickexample.com/clicYou can find the original robots.txt specification at http://www.robotstxt.org/wc/robots.html.