I want to build a dataset consisting about 2000-3000 web pages, starting with several seed URLs. I tried it using the Nutch crawler but I was unable to get it done (unable to convert the ‘segments’ data fetched into html pages) .
Any suggestions of a different crawler that you have used or any other tool? What if web pages contain absolute URLs which will make offline use of the dataset impossible?
You can NOT directly convert the nutch crawled segments to html files directly.
I suggest you these options:
org.apache.nutch.segment.SegmentReaderclass. You can then dig into it to modify the working as per your use case).bin/nutch readdb” command (use dump option). Then write a script to wget the urls and save it in html form. Done !!