I recently started looking apache nutch. I could do setup and able to crawl

Question

0

Asked: June 4, 20262026-06-04T16:23:42+00:00 2026-06-04T16:23:42+00:00

I recently started looking apache nutch. I could do setup and able to crawl

0

I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T16:23:43+00:00

Useful commands.

Begin crawl

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Get statistics of crawled URL’s

bin/nutch readdb crawl/crawldb -stats

Read segment (gets all the data from web pages)

bin/nutch readseg -dump crawl/segments/* segmentAllContent

Read segment (gets only the text field)

bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate -     noparse -noparsedata

Get all list of known links to each URL, including both the source URL and anchor text of the link.

bin/nutch readlinkdb crawl/linkdb/ -dump linkContent

Get all URL’s crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.

bin/nutch readdb crawl/crawldb/ -dump crawlContent

For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.

Refer:

this and this

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I recently started looking apache nutch. I could do setup and able to crawl

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply