I have a Nutch crawl task which has been running a whole day long

Question

0

Asked: June 1, 20262026-06-01T16:47:51+00:00 2026-06-01T16:47:51+00:00

I have a Nutch crawl task which has been running a whole day long

0

I have a Nutch crawl task which has been running a whole day long until i killed the process by a mistake.

I don’t want to re-crawl the seeds (cost to much time), so i wonder whether there is a way or some Nutch Crawler parameters there, can make the crawler ignore those urls which has already been crawled.

Many thanks !

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T16:47:52+00:00

Editorial Team

2026-06-01T16:47:52+00:00Added an answer on June 1, 2026 at 4:47 pm

After you started crawling, there might be some segments created in the output directory. Use bin/nutch command and point -dir option to the output directory of previous run. For urlDir argument, create a dummy one with a single url (just for getting away from error if the urldir doesnt have any url in it.)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a Nutch crawl task which has been running a whole day long

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply