I need a good page rendering library so that I can extract all links(including anchor text, the underlying hyperlink, absolute position of the link on the page) from a web page.
I have been using the CSSBox library, however, the href attribute is missing from the rendered box model. In other words, I can only obtain anchor text and position of the link using CSSBox alone. I have to take advantage of the anchor text and another html parsing library(e.g.,Jsoup) to get the href attribute(i.e., the de facto URL).
Is there any library that can better achieve my goal?
Recommendation
Consider using Geb:
Requirements
As mentioned, this is only suitable if you are open to the use of Groovy. However, since Groovy integrates so easily with Java, this typically isn’t a problem.
Furthermore, this will require a browser. I’m not sure if this is a deal breaker for you.
Usage
From the docs:
For example: