I want to remove specific elements from the page response, before it is handed down to nutch.
Specifically, I want to mark parts of my pages with i.e.
<div class="noindex">I shall not be indexed</div>
And want to remove them before nutch parse, so that “I shall not be indexed” is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.
Thanks,
Paul
You have some alternativer for doing that:
You can write a plugin for nutch for doing that. This blog have an execelent example of doing a plugin in nutch: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
Using an extractor content: Here http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/ have some algorithmics. Maybe the best way of doing that it´s also in a pluggin in nutch.