I’m getting results back from a crawled internal site. The problem is I’m getting multiple results because of the use of location hashes in the code:
http://site.com/en/personal/refunds.html
http://site.com/en/personal/refunds.html#
http://site.com/en/personal/refunds.html#content
http://site.com/en/personal/refunds.html#section1
Although they may all be relevant, it doesn’t look good when they’re my top four results!
Any way they can be seen as one result?
It looks like # and #content occur on most pages, so I could apply some rule to filter these out. They’re used to skip to content and another to toggle accessibility stylesheet.
OK I got this working, what I did was edit the regex-normalize.xml file and told it to ignore URLs with # in them:
I needed to add “urlfilter-regex” to the plugin.includes property in nutch-site.xml to get it to use this file.