After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://….. -depth

Question

0

Asked: June 2, 20262026-06-02T05:26:28+00:00 2026-06-02T05:26:28+00:00

After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://….. -depth

0

After a Nutch crawl in distributed (deploy) mode as follows:

bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20

I need to extract each URL fetched along with it’s content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn’t lend itself to being map reduced.

bin/nutch readseg -dump /crawl/segments/*  /output  -nogenerate -noparse -noparsedata -noparsetext

Ideally the output should be in this format:

http://abc.com/1     content of http://abc.com/1
http://abc.com/2     content of http://abc.com/2

Any suggestions on how to achieve this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T05:26:31+00:00

The answer lies in tweaking the source code of nutch. This turned out to be quite simple. Navigate to the SegmentReader.java file at apache-nutch-1.4-bin/src/java/org/apache/nutch/segment

Inside the SegmentReader class is a method reduce which is responsible for generating the human readable output the bin/nutch readseg command generates. Alter the StringBuffer dump variable as you see fit – this holds the entire output for a given url which is represented by the key variable.

Make sure you to run ant to create a new binary and further calls to bin/nutch readseg shall generate the output in your custom format.

These references were extremely useful in navigating the code:
[1] http://nutch.apache.org/apidocs-1.4/overview-summary.html
[2] http://nutch.apache.org/apidocs-1.3/index-all.html

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://….. -depth

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply