I’m learning Jsoup and have this HTML:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
[...]
I use Jsoup.parse() and document select(“p”) for catch “content” (and works nice). But…
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
<p style="..."></p>
<p style="..."></p>
</p>
[...]
In this scene, I see that Jsoup.parse() convert this code to:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
<p style="..."> <!-- div 4 -->
</p>
<p style="..."> <!-- div 5 -->
</p>
[...]
How can I keep order of nested paragraphs with Jsoup (div 4 & 5 inside of div 3)?
Add a example:
HTML file:
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
</p>
</body>
</html>
Java code:
Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());
Return:
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
<p></p>
</body>
</html>
Is correct this? I using Jsoup 1.6.1. I understand that Jsoup should return nested paragraphs instead of previous return.
Nested paragraphs do not exist in HTML. The prior paragraph is closed automatically since Jsoup implements the WHATWG HTML5 specification:
ptag is automatically closed by any of the following:address,article,aside,blockquote,div,dl,fieldset,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,hr,main,menu,nav,ol,p,pre,section,table, orul. Therefore<p><div></div> becomes <p></p><div></div>.p(ie</p>) that does not have a corresponding start tag is a parse error and is replaced with<p>. Therefore<span></span></p>becomes<span></span><p>.So jsoup is correct and your HTML is invalid.
Be sure to comprehend that your HTML is invalid because you have too many
</p>and not because “nesting” paragraphs. Nesting cannot happend because they get auto-closed. But the later coming</p>is obsolet because the “corresponding”<p>was already auto-closed before.