I have a Nutch crawl task which has been running a whole day long until i killed the process by a mistake.
I don’t want to re-crawl the seeds (cost to much time), so i wonder whether there is a way or some Nutch Crawler parameters there, can make the crawler ignore those urls which has already been crawled.
Many thanks !
After you started crawling, there might be some segments created in the output directory. Use bin/nutch command and point
-diroption to the output directory of previous run. ForurlDirargument, create a dummy one with a single url (just for getting away from error if the urldir doesnt have any url in it.)