I’m implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a “table” or a “span” tag.
Here’s my code snippet:
Document doc = Jsoup.connect(url)
.timeout(TIMEOUT * 1000)
.get();
Elements elts = doc.getElementsByTag("a");
And here’s the example HTML:
<table>
<tr><td><a href="www.example.com"></a></td></tr>
</table>
My code will not fetch such links. Using doc.select doesn’t help too. My question is, how to get all the links from the page?
EDIT: I think I know where the problem is. THe page I’m having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?
In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output
doc.toString()).Tip: use
select()instead ofgetElementsByX(), its faster and more flexible.Elements elts = doc.select("a");(edit)Here’s an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax