Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6105513
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T13:58:49+00:00 2026-05-23T13:58:49+00:00

I want to crawl onyl html pages so when I changed the regular expression

  • 0

I want to crawl onyl html pages so when I changed the regular expression here in this code.. it is still crawling some xml page also.. Any suggestions why is it happening..

public class MyCrawler extends WebCrawler {


    Pattern filters = Pattern.compile("(.(html))");

    public MyCrawler() {
    }

    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        if (filters.matcher(href).matches()) {
            return false;
        }
        if (href.startsWith("http://www.somehost.com/")) {
            return true;
        }
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();

        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T13:58:50+00:00Added an answer on May 23, 2026 at 1:58 pm

    The extension is meaningless on the web – especially with newer “SEO”-type paths. You have to analyze it’s content-type.

    You can do this by requesting (with the HTTP GET or possibly HEAD method) each URL and analyze its response headers. If the Content-Type response header is not what you want, you don’t have to download it – otherwise it’s what you want to look at.

    Edit: HTML should have text/html as content-type, XHTML is application/xhtml+xml (but note that the latter may be subject to content-negotiation, which is usually dependent on the content of your accept header and the user agent in the request).

    You can find all the information about the HTTP headers here.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to crawl and save some webpages as HTML. Say, crawl into hundreds
I want to crawl some data out of a phpBB forum i'm a member
This is the below code in my MyCrawler.java and it is crawling all those
Here's my issue. I have this function to test proxy servers. function crawl() {
This website showing forex rates of different countries, and i want to crawl all
I want to use java.net.url to crawl some websites and retrieve some data. I
I'm using WWW::Mechanize::Firefox to crawl pages that load some JavaScript after they have been
I want to crawl a website and store the content on my computer for
i have one domain link text i want to know that does google crawl
Want to code a key pad for an calculator. What I want to make

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.