I want to use java to retrieve text from a website. I can easily get the source by doing: (Thank you random internet person who posted this somewhere else)
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class WebCrawler{
public static void main(String[] args) {
try {
URL google = new URL("http://stackoverflow.com");
URLConnection yc = google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
However this leaves me with the problem of some sites returning 403s. Is there a way of getting around this?
Very simply I was hoping to use java to create a simple bot that would scan a forum thread and automatically respond based of user queries. Am I able to do this in java? or do I need to look at it from the perspective of another language/ data retrieval method?
Thank you for your time.
Yes, this can be done in Java. In theory, anything a web browser can do, Java can do – since, in the very worst case, you could write a web browser in Java.
A 403 is a “forbidden” response. You may need to set a particular
User-Agentheader, or the site might require HTTP basic authentication. Or perhaps it’s rate-limiting you and you need to see about obeying theirrobots.txtrules…Java is certainly not (in my opinion) the easiest language in which to write this type of code, but you’re on a decent track here.
As for your “not source” in the title – the source of a web page is its text. If you download the page, you’re going to get HTML; it’s up to you to parse out what you need and discard the dross.