I am crawling a page that requires username and password for authentication. And I

Question

0

Asked: May 27, 20262026-05-27T21:14:28+00:00 2026-05-27T21:14:28+00:00

I am crawling a page that requires username and password for authentication. And I

0

I am crawling a page that requires username and password for authentication. And I successfully got the 200 OK response back from the server for that page when I passed my username and password in the code. But it gets stop as soon as it gives the 200 OK response back. It doesn’t move forward in to that page after authentication to crawl all those links that are there in that page. And this crawler is taken from http://code.google.com/p/crawler4j/.
This is the code where I am doing the authentication stuff…

public class MyCrawler extends WebCrawler {

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    List<String> exclusions;


    public MyCrawler() {

        exclusions = new ArrayList<String>();
        //Add here all your exclusions

    exclusions.add("http://www.dot.ca.gov/dist11/d11tmc/sdmap/cameras/cameras.html");

    }


    public boolean shouldVisit(WebURL url) {

    String href = url.getURL().toLowerCase();


    DefaultHttpClient client = null;

        try
        {
        System.out.println("----------------------------------------");
            System.out.println("WEB URL:- " +url);


            client = new DefaultHttpClient();

            client.getCredentialsProvider().setCredentials(
                    new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
                    new UsernamePasswordCredentials("test", "test"));
            client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true);



        for(String exclusion : exclusions){
            if(href.startsWith(exclusion)){
                return false;
            }
        }   

        if (href.startsWith("http://") || href.startsWith("https://")) {
            return true;
        }

            HttpGet request = new HttpGet(url.toString());

            System.out.println("----------------------------------------");
            System.out.println("executing request" + request.getRequestLine());
            HttpResponse response = client.execute(request);
            HttpEntity entity = response.getEntity();


            System.out.println(response.getStatusLine());



    }
        catch(Exception e) {
            e.printStackTrace();
        }


        return false;
    }

    public void visit(Page page) {
    System.out.println("hello");
    int docid = page.getWebURL().getDocid();
        String url = page.getWebURL().getURL();
        System.out.println("Page:- " +url);
        String text = page.getText();
        List<WebURL> links = page.getURLs();
    int parentDocid = page.getWebURL().getParentDocid();


    System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);

}
}

And this is my Controller class

public class Controller {
    public static void main(String[] args) throws Exception {

            CrawlController controller = new CrawlController("/data/crawl/root");


//And I want to crawl all those links that are there in this password protected page             
            controller.addSeed("http://search.somehost.com/");

            controller.start(MyCrawler.class, 20);  
            controller.setPolitenessDelay(200);
            controller.setMaximumCrawlDepth(2);
    }
}

Anything wrong I am doing….

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T21:14:28+00:00

Editorial Team

2026-05-27T21:14:28+00:00Added an answer on May 27, 2026 at 9:14 pm

As described in http://code.google.com/p/crawler4j/ the shoudVisit() function should only return true or false. But in your code, this function is also fetching the content of the page which is wrong. The current version of crawler4j (3.0) doesn’t support crawling of password-protected pages.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am crawling a page that requires username and password for authentication. And I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply