I am writing a crawler/parser that should be able to process different types of

Question

0

Asked: May 24, 20262026-05-24T00:49:02+00:00 2026-05-24T00:49:02+00:00

I am writing a crawler/parser that should be able to process different types of

0

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.

Unfortunately, checking the content-type using the provided in method in URLConnection doesn’t always work. For example,

String contentType = url.openConnection().getContentType();

doesn’t always provide the correct content-type (e.g “text/html” where it should be RSS) or doesn’t allow to distinguish between RSS and Atom (e.g. “application/xml” could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method ‘parse()’ is called, downloads the entire InputStream a second time.

public Parser createParser(){

    InputStream inputStream = null;
    String contentType = null;
    String contentEncoding = null;

    ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
    Parser parser = null;

    try {

        inputStream = new BufferedInputStream(this.url.openStream());
        contentTypeParser.parse(inputStream);
        contentType = contentTypeParser.getContentType();
        contentEncoding = contentTypeParser.getContentEncoding();

        assert (contentType != null);

        inputStream = new BufferedInputStream(this.url.openStream());

        if (contentType.equals(ContentTypes.rss))
        {
            logger.info("RSS feed detected");
            parser = new RssParser(this.url);
            parser.parse(inputStream);
        }
        else if (contentType.equals(ContentTypes.atom))
        {
            logger.info("Atom feed detected");
            parser = new AtomParser(this.url);
        }
        else if (contentType.equals(ContentTypes.html))
        {
            logger.info("html detected");
            parser = new HtmlParser(this.url);
            parser.setContentEncoding(contentEncoding);
        }
        else if (contentType.equals(ContentTypes.UNKNOWN))
            logger.debug("Unable to recognize content type");

        if (parser != null)
            parser.parse(inputStream);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    return parser;

}

Basically, I am looking for a solution that allows me to eliminate the second “inputStream = new BufferedInputStream(this.url.openStream())”.

Any help would be greatly appreciated!

Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.

Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T00:49:03+00:00

Editorial Team

2026-05-24T00:49:03+00:00Added an answer on May 24, 2026 at 12:49 am

Can you just call mark and reset?

inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number

contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();

inputstream.reset(); // Let the parser have a crack at it now

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a crawler/parser that should be able to process different types of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply