I am using nutch 1.4 to crawl websites. For demo purpose, I started crawling

Question

0

Asked: May 31, 20262026-05-31T22:37:22+00:00 2026-05-31T22:37:22+00:00

I am using nutch 1.4 to crawl websites. For demo purpose, I started crawling

0

I am using nutch 1.4 to crawl websites. For demo purpose, I started crawling with jabong.com but i observed that nutch could not fetch all the links in the site.

After visiting http://www.jabong.com/women/clothing/womens-suits-sets/
It is not fetching links present in this site which are mapped on images.

I have configured nutch as:-
conf/nuth-default.xml —> added the agent name
conf/regex-urlfilter.txt —> Instead of +. , I wrote +^http://([a-z0-9]*.)*jabong.com/
seed.txt contains http://www.jabong.com/

Can someone tell me what could be the problem it is not fetching all the links ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T22:37:24+00:00

Finally, able to solve this problem after breaking my head for long. So sharing it here 🙂
You have to adjust the parameters defined in nutch-default.xml in conf directory

So check the max.content.length, value defined for this will be around 60K but actually the page content was much more so it was not able to crawl whole page and that’s why the links were not able to show up in crawled page.

So before crawling any site do check these parameters 🙂
Enjoy crawling 🙂

PS: I am sorry i case some1 feels that I post question here and then post solution. Before posting question i actually tried a lot..

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using nutch 1.4 to crawl websites. For demo purpose, I started crawling

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply