I am crawling a page that requires username and password for authentication. And I successfully got the 200 OK response back from the server for that page when I passed my username and password in the code. But it gets stop as soon as it gives the 200 OK response back. It doesn’t move forward in to that page after authentication to crawl all those links that are there in that page. And this crawler is taken from http://code.google.com/p/crawler4j/.
This is the code where I am doing the authentication stuff…
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
List<String> exclusions;
public MyCrawler() {
exclusions = new ArrayList<String>();
//Add here all your exclusions
exclusions.add("http://www.dot.ca.gov/dist11/d11tmc/sdmap/cameras/cameras.html");
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
DefaultHttpClient client = null;
try
{
System.out.println("----------------------------------------");
System.out.println("WEB URL:- " +url);
client = new DefaultHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
new UsernamePasswordCredentials("test", "test"));
client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true);
for(String exclusion : exclusions){
if(href.startsWith(exclusion)){
return false;
}
}
if (href.startsWith("http://") || href.startsWith("https://")) {
return true;
}
HttpGet request = new HttpGet(url.toString());
System.out.println("----------------------------------------");
System.out.println("executing request" + request.getRequestLine());
HttpResponse response = client.execute(request);
HttpEntity entity = response.getEntity();
System.out.println(response.getStatusLine());
}
catch(Exception e) {
e.printStackTrace();
}
return false;
}
public void visit(Page page) {
System.out.println("hello");
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
System.out.println("Page:- " +url);
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Text length: " + text.length());
System.out.println("Number of links: " + links.size());
System.out.println("Docid of parent page: " + parentDocid);
}
}
And this is my Controller class
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
//And I want to crawl all those links that are there in this password protected page
controller.addSeed("http://search.somehost.com/");
controller.start(MyCrawler.class, 20);
controller.setPolitenessDelay(200);
controller.setMaximumCrawlDepth(2);
}
}
Anything wrong I am doing….
As described in http://code.google.com/p/crawler4j/ the shoudVisit() function should only return true or false. But in your code, this function is also fetching the content of the page which is wrong. The current version of crawler4j (3.0) doesn’t support crawling of password-protected pages.