I have installed the apache nutch for web crawling. I want to crawl a

Question

0

Asked: June 10, 20262026-06-10T01:16:09+00:00 2026-06-10T01:16:09+00:00

I have installed the apache nutch for web crawling. I want to crawl a

0

I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt:

User-Agent: *
Disallow: /

Is there any way to crawl this website with apache nutch?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T01:16:11+00:00

In nutch-site.xml, set protocol.plugin.check.robots to false

OR

You can comment out the code where the robots check is done.
In Fetcher.java, lines 605-614 are doing the check. Comment that entire block

      if (!rules.isAllowed(fit.u)) {
        // unblock
        fetchQueues.finishFetchItem(fit, true);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Denied by robots.txt: " + fit.url);
        }
        output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
        reporter.incrCounter("FetcherStatus", "robots_denied", 1);
        continue;
      }

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have installed the apache nutch for web crawling. I want to crawl a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply