I am trying to solve the following problem.
Assume I have a HTML file that reads:
</div class = nameCouldBeAnything1><br>
<p>some text here</p><br>
</div>
<div class = nameCouldBeAnything2><br>
<p>some more text here</p><br>
</div>
<div class = nameCouldBeAnything3><br>
<p>even more text here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
</div>
What I am trying to achieve is to store the contents in between the div tags into separate string or string array variables.
If there is a Jsoup solution this would be great, if there isn’t then a regex string matching starting from p and ending at /p would be great also.
The challenges to take into consideration are:
1) You can not use specific div class names to pinpoint the location of the p tags in order to obtain the plaintext using Jsoup.
2) Using doc.select("body p") or doc.select("div p") from Jsoup kind of works, however when you want to store the p tags into string variables they will be written individually into variables instead of by div into variables.
This is what I have so far:
htmlFile = Jsoup.parse(input, "UTF-8");
Elements body = htmlFile.select("body p");
Element bodyStart = body.first();
Element bodyEnd = body.last();
Element p = bodyStart;
int divCount = 0;
while(p != bodyEnd)
{
p = body.get(divCount);
System.out.println(p.text());
divCount++;
}
This will get each individual p tag however I want the p tags to stay within their respective divs and store each individual div into string/string array variables.
I was able to solve my dilemma.
This is the code I used, hopefully it helps someone in need.
Thanks to everyone that posted.
Note, there may be some syntax errors, I manually typed this in.