This is the below code in my MyCrawler.java and it is crawling all those

Question

0

Asked: May 23, 20262026-05-23T19:31:01+00:00 2026-05-23T19:31:01+00:00

This is the below code in my MyCrawler.java and it is crawling all those

0

This is the below code in my MyCrawler.java and it is crawling all those links that I have provided in href.startsWith but suppose If I do not want to crawl this particular page http://inv.somehost.com/people/index.html then how can I do this in my code..

public MyCrawler() {
    }

    public boolean shouldVisit(WebURL url) {

        String href = url.getURL().toLowerCase();


    if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) {
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done..


                    return true;
                }
                return false;
            }


    public void visit(Page page) {

        int docid = page.getWebURL().getDocid();

        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        try {
            URL url1 = new URL(url);
            System.out.println("URL:- " +url1);
            URLConnection connection = url1.openConnection();

            Map responseMap = connection.getHeaderFields();
            Iterator iterator = responseMap.entrySet().iterator();
            while (iterator.hasNext())
            {
                String key = iterator.next().toString();

                if (key.contains("text/html") || key.contains("text/xhtml"))
                {
                    System.out.println(key);
                    // Content-Type=[text/html; charset=ISO-8859-1]
                    if (filters.matcher(key) != null){
                        System.out.println(url1);
                        try {
                            final File parentDir = new File("crawl_html");
                            parentDir.mkdir();
                            final String hash = MD5Util.md5Hex(url1.toString());
                            final String fileName = hash + ".txt";
                            final File file = new File(parentDir, fileName);
                            boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt


                             System.out.println("hash:-"  + hash);

                                    System.out.println(file);
                            // Create file if it does not exist



                                // File did not exist and was created
                                FileOutputStream fos = new FileOutputStream(file, true);

                                PrintWriter out = new PrintWriter(fos);

                                // Also could be written as follows on one line
                                // Printwriter out = new PrintWriter(new FileWriter(args[0]));

                                            // Write text to file
                                Tika t = new Tika();
                                String content= t.parseToString(new URL(url1.toString()));


                                out.println("===============================================================");
                                out.println(url1);
                                out.println(key);
                                //out.println(success);
                                out.println(content);

                                out.println("===============================================================");
                                out.close();
                                fos.flush();
                                fos.close();



                        } catch (FileNotFoundException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        } catch (IOException e) {
                            // TODO Auto-generated catch block

                            e.printStackTrace();
                        } catch (TikaException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }


                        // http://google.com
                    }
                }


            }



        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }



        System.out.println("=============");
    }

And this is my Controller.java code from where MyCrawler is getting called..

public class Controller {
    public static void main(String[] args) throws Exception {
            CrawlController controller = new CrawlController("/data/crawl/root");
            controller.addSeed("http://www.somehost.com/");
            controller.addSeed("http://inv.somehost.com/");
            controller.addSeed("http://jo.somehost.com/");
            controller.start(MyCrawler.class, 20);  
            controller.setPolitenessDelay(200);
            controller.setMaximumCrawlDepth(2);
    }
}

Any suggestions will be appreciated..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T19:31:01+00:00

How about adding a property to tell which urls you want to exclude.

Add to your exclusions list all the pages that you don’t want them to get crawled.

Here is an example:

public class MyCrawler extends WebCrawler {


        List<Pattern> exclusionsPatterns;

        public MyCrawler() {
            exclusionsPatterns = new ArrayList<Pattern>();
            //Add here all your exclusions using Regular Expresssions
            exclusionsPatterns.add(Pattern.compile("http://investor\\.somehost\\.com.*"));
        }

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();

                //Iterate the patterns to find if the url is excluded.
               for (Pattern exclusionPattern : exclusionsPatterns) {
                   Matcher matcher = exclusionPattern.matcher(href);
                   if (matcher.matches()) {
                      return false;
                   }
               }

                if (href.startsWith("http://www.ics.uci.edu/")) {
                        return true;
                }
                return false;
        }
}

In this example we are telling that all urls that start with http://investor.somehost.com should not be crawled.

So these wont be crawled:

http://investor.somehost.com/index.html
http://investor.somehost.com/something/else

I recommend you reading about regular expresions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is the below code in my MyCrawler.java and it is crawling all those

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply