This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is

Question

0

Asked: May 23, 20262026-05-23T21:17:47+00:00 2026-05-23T21:17:47+00:00

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is

0

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

And this is the code for Controller.java from where MyCrawler is getting called..

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

So I just want to make sure what does this line means in controller.java file

controller.start(MyCrawler.class, 10);

here what is the meaning of 10.. And if we Increase this 10 to 20 then what will be the effect… Any suggestions will be appreciated…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T21:17:48+00:00

Editorial Team

2026-05-23T21:17:48+00:00Added an answer on May 23, 2026 at 9:17 pm

This website shows the source for CrawlController.

Incrementing from 10 to 20 increases the number of crawlers (each in their own thread) – studying that code will tell you what affect this will have.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply