I am currently working on updating a project for University. The program in questions visits IP addresses and determines if the IP hosts a website. The goal of the system is to determine the size of the web (distributed across the available systems, last run took 2.5 months).
The current goal to to try and decrease the time it takes to make a accurate decision for a IP however I am lost as to improve this. Currently, the following is the main source of testing (with the additional logic of course):
Socket s = new Socket();
s.connect(new InetSocketAddress(address, 80), timeout);
What I am mainly looking for/asking for help with is if there is any faster method to determine if a IP hosts a website, while remaining accurate. The current system uses a timeout value of 30 seconds so a large number of IP address checks require those 30 seconds as many IPs do not host a website. Any help pointing towards a Java library or a paper on a algorithm would be greatly appreciated.
Thanks.
The only reliable way to determine if a host is willing to serve you a webpage on a given port is to request it, which would always result in opening a TCP-socket and send a HTTP GET-Request. However, you can use techniques (and c-libraries) from NMAP http://nmap.org/ to efficiently detect if there is a TCP-Endpoint at :80. Of course you can tune your program to check a couple of thousand hosts at the same time, per public IP …
Notice however, that your entire approach can only give a very vague number of web-servers on port 80, nothing more. There are other ports, encryption (SSL) and multiple websites per http-host that mess with your meassurements. And don’t forget that there is IPv4 and IPv6.