I’m trying to run Apache Nutch from Eclipse. I followed the instructions at http://wiki.apache.org/nutch/RunNutchInEclipse. However, sources of “parse-html” (both java and test) has errors. I run it anyway, it reads and fetches URL’s from the seed.txt and returns this error:
Fetcher: finished at 2012-03-31 17:21:56, elapsed: 00:00:07
ParseSegment: starting at 2012-03-31 17:21:56
ParseSegment: segment: crawl/segments/20120331172142
Exception in thread "main" java.io.IOException: Job failed!
I would like to point out that my goal is to get indexes from Nutch and store them in MongoDB.
I found 3 jars and added them to the project as external jars and it worked. Those jars are : cyberneko.jar, rome-0.9.jar and tagsoup-1.2.jar and you can find all by a simple google search.