I was under the impression that the most costly method in Jsoup’s API is parse().
But I just discovered that Document.html() could be even slower.
Given that the Document is the output of parse() (i.e. this is after parsing), I find this surprising.
Why is Document.html() so slow?
Answering myself. The Element.html() method is implemented as:
Using StringBuilder instead of String is already a good thing, and the use of
StringBuilder.toString()andString.trim()may not explain the slowness ofDocument.html(), even for a relatively large document.But in the middle, our method calls an overloaded version,
Element.html(StringBuilder)which loops through all child nodes in the document:Thus if the document contains lots of child nodes, it will be slow.
It would be interesting to see whether there could be a faster implementation of this.
For example, if Jsoup stores a cached version of the raw html that was provided to it via
Jsoup.parse(). As an option of course, to maintain backward compatibility and small footprint in memory.