I can crawl and index the web pages using Nutch, but I don’t know how to read the index and extract data from it.
Could anyone introduce to me some useful tools to read the index?
I want to add a Chinese Language Analyzer and a IndexFilter plugin, so I want to read the index to validate my plugin. And also, I want to do some process with the data I crawled using Java.
Use luke tool to browse the nutch indexes. The dump index option can create an xml file for entire index. If you have to do it via code, then you need to learn lucene.
To read the crawled content, use the nutch segment reader.