i can succesfully run crawl command via cygwin on windows xp. and i can

Question

Editorial Team

Asked: May 14, 20262026-05-14T07:15:43+00:00 2026-05-14T07:15:43+00:00

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat.

but i also want to save parsed pages during crawling event

so when i start crawling with like this

bin/nutch crawl urls -dir crawled -depth 3

i also want save parsed html files to text files

i mean during this period which i started with above command

nutch when fetched a page it will also automaticly save that page parsed (only text) to text files

these files names could be fetched url

i really need help about this

this will be used at my university language detection project

ty

You must login to add an answer.

Need An Account,

Editorial Team · Answer 1 · 2026-05-14T07:15:43+00:00

Editorial Team

The crawled pages are stored in the segments. You can have access to them by dumping the segment content:

nutch readseg -dump crawl/segments/20100104113507/ dump

You will have to do this for each segment.

The Archive Base Latest Questions