Thing is: I have a webcrawler framework, and independent modules that implement this framework. All of these modules capture news from news specific websites.
In the framework there are 2 unpredictable errors which are: IOException, and SocketTimeoutException. For obvious reasons (The website may be offline, and/or under maintenance)
Thing is: In a specific website (THIS one) I get random IOExceptions all the time. I tried predicting it, but I still don’t know why I’m getting this error.
I figured it was from bombing it with requests during test phase. It is not, since in 2 or 3 days without sending another requisition it still throws me the error.
In a nutshell: The site do not require authentication, and it randomly throws 403. RANDOMLY
Since 403 can be multiple different errors, I’d like to see what is the specific problem with my application.
If I could get which 403 it i, I could try and work around it. (403.1, 403.2, …, 403.n)
//If you guys want the code, it's a basic Jsoup get.
//(I have also tried it with native API,
//and still get the same random 403 errors)
//Note that I also tried it with no redirection, and still get the error
Document doc = Jsoup
.connect("http://www.agoramt.com.br/")
.timeout(60000)
.followRedirects(true)
.get();
//You may criticize about the code. But this specific line is the one
//that throws the error. And it doesn't randomly do that to other 3k
//site connections. That's why I want to get the specifics from the 403
A server may return a 403 on a whim. You are not expected to resolve this on your end except to respect the server’s wishes not to let you in. You may try to read the response body for details provided by the server, but that’s probably all you’ll get. The 403.n error codes you are looking for, I believe, is an IIS-specific feature and the site you pointed out seems to be serving with nginx, so don’t expect to get those.
If your webcrawler randomly gets a 403 but a regular web browser (from the same IP) never gets a 403 then the best I could suggest is for you to make your webcrawler request headers look exactly like what a regular web browser would send. Whether that is proper behavior for a polite webcrawler is a different discussion.