Jsoup has 2 html parse() methods:
- parse(String html) – “As no base URI is specified, absolute URL
detection relies on the HTML including a tag.” - parse(String html, String baseUri) – “The URL where the HTML
was retrieved from. Used to resolve relative URLs to absolute URLs,
that occur before the HTML declares a tag.”
I am having a difficulty understanding the meaning of the difference between the two:
- In the 2nd
parse()version, what does “resolve relative URLs to absolute URLs, that occur
before the HTML declares a<base href>tag” mean? What if a
<base href>tag never occurs in the page? - What is the purpose of absolute URL detection? Why does Jsoup need
to find the absolute URL? - Lastly, but most importantly: Is
baseUrithe full URL of HTML page
(as phrased in original documentation) or is it the base URL of
the HTML page?
It’s used for among others
Element#absUrl()so that you can retrieve the (intended) absolute URL of an<a href>,<img src>,<link href>,<script src>, etc. E.g.This is very useful if you want to download and/or parse the linked resources as well.
Some (poor) websites may have declared a
<link>or<script>with a relative URL before the<base>tag. Or if there is no means of a<base>tag, then just the givenbaseUriwill be used for resolving relative URLs of the entire document.In order to return the right URL on
Element#absUrl(). This is purely for enduser’s convenience. Jsoup doesn’t need it in order to successfully parse the HTML at its own.The former. If the latter, then documentation would be lying. The
baseUrimust not to be confused with<base href>.