Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6180273
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T00:49:02+00:00 2026-05-24T00:49:02+00:00

I am writing a crawler/parser that should be able to process different types of

  • 0

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.

Unfortunately, checking the content-type using the provided in method in URLConnection doesn’t always work. For example,

String contentType = url.openConnection().getContentType();

doesn’t always provide the correct content-type (e.g “text/html” where it should be RSS) or doesn’t allow to distinguish between RSS and Atom (e.g. “application/xml” could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method ‘parse()’ is called, downloads the entire InputStream a second time.

public Parser createParser(){

    InputStream inputStream = null;
    String contentType = null;
    String contentEncoding = null;

    ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
    Parser parser = null;

    try {

        inputStream = new BufferedInputStream(this.url.openStream());
        contentTypeParser.parse(inputStream);
        contentType = contentTypeParser.getContentType();
        contentEncoding = contentTypeParser.getContentEncoding();

        assert (contentType != null);

        inputStream = new BufferedInputStream(this.url.openStream());

        if (contentType.equals(ContentTypes.rss))
        {
            logger.info("RSS feed detected");
            parser = new RssParser(this.url);
            parser.parse(inputStream);
        }
        else if (contentType.equals(ContentTypes.atom))
        {
            logger.info("Atom feed detected");
            parser = new AtomParser(this.url);
        }
        else if (contentType.equals(ContentTypes.html))
        {
            logger.info("html detected");
            parser = new HtmlParser(this.url);
            parser.setContentEncoding(contentEncoding);
        }
        else if (contentType.equals(ContentTypes.UNKNOWN))
            logger.debug("Unable to recognize content type");

        if (parser != null)
            parser.parse(inputStream);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    return parser;

}

Basically, I am looking for a solution that allows me to eliminate the second “inputStream = new BufferedInputStream(this.url.openStream())”.

Any help would be greatly appreciated!

Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.

Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T00:49:03+00:00Added an answer on May 24, 2026 at 12:49 am

    Can you just call mark and reset?

    inputStream = new BufferedInputStream(this.url.openStream());
    inputStream.mark(2048); // Or some other sensible number
    
    contentTypeParser.parse(inputStream);
    contentType = contentTypeParser.getContentType();
    contentEncoding = contentTypeParser.getContentEncoding();
    
    inputstream.reset(); // Let the parser have a crack at it now
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm writing a basic crawler that simply caches pages with PHP. All it does
I'm writing a specialized crawler and parser for internal use, and I require the
I'm writing a crawler for Ruby, and I want to honour the headers that
I am writing a crawler in Python, in order to make Ctrl+C not to
Writing a JSP page, what exactly does the <c:out> do? I've noticed that the
I'm writing a simple web crawler in Ruby and I need to fetch all
I am writing a crawler in Perl, which has to extract contents of web
I am writing a crawler. Once after the crawler logs into a website I
I'm writing a disk crawler and if the user doesn't provide an existing path
Im writing a unit test for a c# class, One of my tests should

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.