I want to build a dataset consisting about 2000-3000 web pages, starting with several

Question

0

Asked: June 1, 20262026-06-01T08:00:55+00:00 2026-06-01T08:00:55+00:00

I want to build a dataset consisting about 2000-3000 web pages, starting with several

0

I want to build a dataset consisting about 2000-3000 web pages, starting with several seed URLs. I tried it using the Nutch crawler but I was unable to get it done (unable to convert the ‘segments’ data fetched into html pages) .

Any suggestions of a different crawler that you have used or any other tool? What if web pages contain absolute URLs which will make offline use of the dataset impossible?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T08:00:56+00:00

You can NOT directly convert the nutch crawled segments to html files directly.

I suggest you these options:

You can try modifying the source code to do that. (study the org.apache.nutch.segment.SegmentReader class. You can then dig into it to modify the working as per your use case).
EASY SOLUTION if you dont want to invest time to study code: Use nutch to crawl all required pages. Then get the actual urls crawled by using the “bin/nutch readdb” command (use dump option). Then write a script to wget the urls and save it in html form. Done !!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to build a dataset consisting about 2000-3000 web pages, starting with several

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply