I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?
Share
Useful commands.
Begin crawl
Get statistics of crawled URL’s
Read segment (gets all the data from web pages)
Read segment (gets only the text field)
Get all list of known links to each URL, including both the source URL and anchor text of the link.
Get all URL’s crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.
For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.
Refer:
this and this