I’m making a simple program to scrape content from several webpages. I want to improve the speed of my program so I want to use threads. I want to be able to control the amount of threads with some integer(down the line I want users to be able to define this).
This is the code I want to create threads for:
public void runLocales(String langLocale){
ParseXML parser = new ParseXML(langLocale);
int statusCode = parser.getSitemapStatus();
if (statusCode > 0){
for (String page : parser.getUrls()){
urlList.append(page+"\n");
}
}else {
urlList.append("Connection timed out");
}
}
And the parseXML class:
public class ParseXML {
private String sitemapPath;
private String sitemapName = "sitemap.xml";
private String sitemapDomain = "somesite";
Connection.Response response = null;
boolean success = false;
ParseXML(String langLocale){
sitemapPath = sitemapDomain+"/"+langLocale+"/"+sitemapName;
int i = 0;
int retries = 3;
while (i < retries){
try {
response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
success = true;
break;
} catch (IOException e) {
}
i++;
}
}
public int getSitemapStatus(){
if(success){
int statusCode = response.statusCode();
return statusCode;
}else {
return 0;
}
}
public ArrayList<String> getUrls(){
ArrayList<String> urls = new ArrayList<String>();
try {
Document doc = response.parse();
Elements element = doc.select("loc");
for (Element page : element){
urls.add(page.text());
}
return urls;
} catch (IOException e) {
System.out.println(e);
return null;
}
}
}
I’ve been reading up about threads for a few days now and i can’t figure out how to implement threading in my case? Can someone offer some insight please?
Something like this should do:
Obviously, you still need to add the code to control how many Threads you want to create, etc., and decide what you want to do if your threshold is reached.