I used nutch 1.4 and crawled a website.
I got the website crawled successfully and all the pages were dumped into segments.
I merged all the segments to one segment and then i used readseg command to obtain a text version of all the crawled pages.
Now I need to find out, URL of page and the meta data stored in that page.
I don’t know which command to use or shall i need to do something different.
Have made a lot of efforts on google Some people said that you have to write a separate plugin for it. Can someone tell me please.
Thanks a lot 🙂 🙂
Finally, I am able to do it. Sharing in case someone else needs it.
You can use index-metatags plugin provided here:
http://wiki.apache.org/nutch/IndexMetatags
It will solve this problem
Cheers 🙂