I am creating a web crawler using Java EE Technologies. I have created a

Question

0

Asked: June 9, 20262026-06-09T17:18:48+00:00 2026-06-09T17:18:48+00:00

I am creating a web crawler using Java EE Technologies. I have created a

0

I am creating a web crawler using Java EE Technologies. I have created a crawler service which contains the result of the WebCrawler in term CrawlerElement objects which contains information of interest to me.

Currently I am using JSOUP Library in order to do this. But it is not reliable I am attempting the connection three times and also timeout is 10seconds still It is unreliable.

By unreliable I mean even if it can be accessed publicly, It can not be accessed by the crawler program. I know it could be due to robots.txt exclusion but in that also it is allowed but still it is unrealiable.

So I decided to go with URLConnection object which has openConnection and then connect method for doing this.

I have one more requirement which is bugging me and that is : I have to get the response time in milliseconds for a CrawlerElement which means how many seconds it took to load page B from Page A?? and I checked the methods of URLConnection there is no way out in order to do that.

Any ideas in that topic? Can anyone help me?

I was thinking writing a code before and after which takes current time in milliseconds before the gettingContent code and current time in milliseconds subtract and save that milliseconds in database but I was thing whether it would be accurate or not?

Thanks in advance.

EDIT : CURRENT IMPLEMENTATION

Current Implementation which gives me statusCode, contentType etc..

import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;


public class GetContent {
public static void main(String args[]) throws IOException {
    URL url = new URL("http://www.javacoffeebreak.com/faq/faq0079.html");
    long startTime = System.currentTimeMillis();
    URLConnection uc = url.openConnection();
    uc.setRequestProperty("Authorization", "Basic bG9hbnNkZXY6bG9AbnNkM3Y=");
    uc.setRequestProperty("User-Agent", "");
    uc.connect();
    long endTime = System.currentTimeMillis();
    System.out.println(endTime - startTime);
    String contentType = uc.getContentType();
    System.out.println(contentType);
    String statusCode = uc.getHeaderField(0);
    System.out.println(statusCode);     
   }
}

what say is it okay to do this way or I should use heavy API’s like Apache HttpClient or Apache Nutch..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T17:18:49+00:00

OK it means you have did work and getting problems in that API/Library.I know it is terrifying to build one thing and then waste that all code and shift to another one but if it would be possible for you As JSoup is just a parser library and it may cause some more problems to you in future so I suggest you to use these more stable API.You can also use crawler4j for that purpose.
Here is the list of some open source crawler API’s and by doing some R&D you can find a good solution for this 🙂

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am creating a web crawler using Java EE Technologies. I have created a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply