I want to crawl onyl html pages so when I changed the regular expression

Question

0

Asked: May 23, 20262026-05-23T13:58:49+00:00 2026-05-23T13:58:49+00:00

I want to crawl onyl html pages so when I changed the regular expression

0

I want to crawl onyl html pages so when I changed the regular expression here in this code.. it is still crawling some xml page also.. Any suggestions why is it happening..

public class MyCrawler extends WebCrawler {


    Pattern filters = Pattern.compile("(.(html))");

    public MyCrawler() {
    }

    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        if (filters.matcher(href).matches()) {
            return false;
        }
        if (href.startsWith("http://www.somehost.com/")) {
            return true;
        }
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();

        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T13:58:50+00:00

The extension is meaningless on the web – especially with newer “SEO”-type paths. You have to analyze it’s content-type.

You can do this by requesting (with the HTTP GET or possibly HEAD method) each URL and analyze its response headers. If the Content-Type response header is not what you want, you don’t have to download it – otherwise it’s what you want to look at.

Edit: HTML should have text/html as content-type, XHTML is application/xhtml+xml (but note that the latter may be subject to content-negotiation, which is usually dependent on the content of your accept header and the user agent in the request).

You can find all the information about the HTTP headers here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to crawl onyl html pages so when I changed the regular expression

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply