Document doc = Jsoup.connect("http://www.utah.edu/").get();
Elements lists = doc.select("ul");
for (Element list: lists) {
Elements li = list.select("li a");
if (li.size() > 0) {
ArrayList<String> anchors = new ArrayList<String>();
for (Element e : li) {
anchors.add(e.text());
}
System.out.println(anchors);
}
}
I’m trying to grab all html lists rendered by the ul tag from this page. But it failed. I suspect there’s script in the page preventing my program from doing so.
Edit: To make my question even simpler, consider the following code:
Document doc = Jsoup.connect("http://www.utah.edu/").get();
Elements lists = doc.select("ul");
System.out.println(lists.size());
Output:
0
A possible answer is that, the User-Agent header sent by jsoup made utah.edu think it’s a bot instead of a browser. So it returns other page content.
In
org/jsoup/helper/HttpConnection.javaimplementedget(), which doesn’t send User-Agent header by default, unless told otherwise.So you need manually set it by using
userAgent().Example, faking Chrome: