Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7737499
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T08:00:55+00:00 2026-06-01T08:00:55+00:00

I want to build a dataset consisting about 2000-3000 web pages, starting with several

  • 0

I want to build a dataset consisting about 2000-3000 web pages, starting with several seed URLs. I tried it using the Nutch crawler but I was unable to get it done (unable to convert the ‘segments’ data fetched into html pages) .

Any suggestions of a different crawler that you have used or any other tool? What if web pages contain absolute URLs which will make offline use of the dataset impossible?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T08:00:56+00:00Added an answer on June 1, 2026 at 8:00 am

    You can NOT directly convert the nutch crawled segments to html files directly.

    I suggest you these options:

    1. You can try modifying the source code to do that. (study the org.apache.nutch.segment.SegmentReader class. You can then dig into it to modify the working as per your use case).
    2. EASY SOLUTION if you dont want to invest time to study code: Use nutch to crawl all required pages. Then get the actual urls crawled by using the “bin/nutch readdb” command (use dump option). Then write a script to wget the urls and save it in html form. Done !!
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to build a console-like output using JTextPane. Therefore I am using a
A DataSet hooked to a grid has: TField.DisplayFormat := '$######.00' I want to build
i want build a photography app with effects . e.g. old images with brown
I want build a sketch pad app on iPhone, I assume that this type
i want to build a screen for user to get his marks on competitions,
I want to build flash application that can detect the user eyes color and
I want to build a form_tag that will allow me to post a new
I want to build a python program that deletes all the photos from my
I want to build a basic Client-Server application, where my android smartphone can stream
I want to build my own listView with selectedItem and Itemsource. I have started

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.