So here’s the scenario. I have a large html file that I would like to scrape using JSoup. I am new to this and I have been going through some tutorials and the API references. I have the following block of html.
<p><a name="bob"></a>
<table class='schedules'>
<tr><td align='center' colspan="5"><b>Bob the Builder</b><br>
<a href="blah blah" class='tiny'>Blah Blah Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><!--<td class='whoohaa'><a href="random/randomUrl.htm">Blah</a></td>--><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td></tr>
</table>
</p>
Now there are many more of these blocks following a similar pattern whereby (in the first line) the name attribute changes (from “bob” to something else). What I would like to do is firstly be able to select the “bob” p block and then retrieve all the html until the terminating p block in the last line.
I have attempted the following:
Elements innerStuff = doc.select("a:contains(bob) ~ *");
But it only gives me links with href atrributes, which I guess is the what will be expected. However, I’m struggling to see how else I can solve this problem?
Your help in this regard is highly appreciated.
A more straitforward way to select the a tag based on its name attribute would be to do:
From there, you should be able to navigate to the elements you want using parent() (to get the p tag containing the link) for example (you’ll need to call first() before to get the first (and only) element matching the selector):
One problem though: the parsed HTML document is different from the original HTML:
Here’s the original HTML structure:
Here’s how the parsed HTML looks like:
So, starting from the link tag, and to get that table’s element, you’ll need to go up one level (to the paragraph) and grab the next element: