I am creating a web crawler using Java EE Technologies. I have created a crawler service which contains the result of the WebCrawler in term CrawlerElement objects which contains information of interest to me.
Currently I am using JSOUP Library in order to do this. But it is not reliable I am attempting the connection three times and also timeout is 10seconds still It is unreliable.
By unreliable I mean even if it can be accessed publicly, It can not be accessed by the crawler program. I know it could be due to robots.txt exclusion but in that also it is allowed but still it is unrealiable.
So I decided to go with URLConnection object which has openConnection and then connect method for doing this.
I have one more requirement which is bugging me and that is : I have to get the response time in milliseconds for a CrawlerElement which means how many seconds it took to load page B from Page A?? and I checked the methods of URLConnection there is no way out in order to do that.
Any ideas in that topic? Can anyone help me?
I was thinking writing a code before and after which takes current time in milliseconds before the gettingContent code and current time in milliseconds subtract and save that milliseconds in database but I was thing whether it would be accurate or not?
Thanks in advance.
EDIT : CURRENT IMPLEMENTATION
Current Implementation which gives me statusCode, contentType etc..
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
public class GetContent {
public static void main(String args[]) throws IOException {
URL url = new URL("http://www.javacoffeebreak.com/faq/faq0079.html");
long startTime = System.currentTimeMillis();
URLConnection uc = url.openConnection();
uc.setRequestProperty("Authorization", "Basic bG9hbnNkZXY6bG9AbnNkM3Y=");
uc.setRequestProperty("User-Agent", "");
uc.connect();
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
String contentType = uc.getContentType();
System.out.println(contentType);
String statusCode = uc.getHeaderField(0);
System.out.println(statusCode);
}
}
what say is it okay to do this way or I should use heavy API’s like Apache HttpClient or Apache Nutch..
OK it means you have did work and getting problems in that API/Library.I know it is terrifying to build one thing and then waste that all code and shift to another one but if it would be possible for you As
JSoupis just a parser library and it may cause some more problems to you in future so I suggest you to use these more stable API.You can also use crawler4j for that purpose.Here is the list of some open source crawler API’s and by doing some R&D you can find a good solution for this 🙂