I’m using JSoup to parse a webpage all links, I then test the response code of these gathered links.
The issue I’m having is some of the pages I’m testing have links that open a javascript popup using: . I’m sure there’s a simple way to avoid selecting this link but I can’t think anymore!
My code:
PingUrls(String pageUrl) {
url = pageUrl;
int i = 0;
int retries = 3;
while (i < retries){
try {
response = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
success = true;
break;
} catch (IOException e) {
}
System.out.println("Attempt "+i);
i++;
}
}
public int getUrlStatus(){
if(success){
int statusCode = response.statusCode();
return statusCode;
}else {
return 404;
}
}
public ArrayList<String> getLinks(String targetValue){
ArrayList<String> urls = new ArrayList<String>();
try {
Document doc = response.parse();
Elements element = doc.select(targetValue+" a[href]");
for (Element page : element){
urls.add(page.attr("abs:href"));
}
return urls;
} catch (IOException e) {
System.out.println(e);
return null;
}
}
First of all I’d avise using a Set instead of a List. (If you’re not familiar with Collections, a Set will make sure that there are no repeated elements)
Also, I’d put a method like manageURL(String url); before you add it to the Collection. Put some tests in it to make sure it craws the way you want. Like testing the url’s absolute path, canonical path, and to make sure it is http or https protocoled.