Is there a library (for java) that compares similarity between web pages (HTML, dom similarity)?
In my application I want to classify links of a website.
For example:
group 1: Product detail page (for online shopping sites, etc.).
group 2: Category page
For such a classification html structure (dom) similarity is the best way I think. Please help regarding this.
Not exactly what you ask but if the HTMl is XML valid you can use XMLUnit, it’s very simple to compare similarity with it.