I am using jsoup to scrape different html pages:
public class HtmlParse {
public static void main(String[] args) throws IOException {
String site = args[0];
Document doc = Jsoup.connect(site).get();
String htm = doc.body().text();
System.out.println(htm);
}
}
It works beautifully. However, there seems to be a lot of fluff associated with its returns (ie: website links [a href]). Is there a quick way to omit this in jsoup? I found the getElementsByTag literature but am having a hard time using it.
Thank you in advance.
You can “clean” parsed Document, see example.
For exammple, to left only simple text:
Or, you can simple delete all
atags: