Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 905037
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T16:08:57+00:00 2026-05-15T16:08:57+00:00

Searching SO and Google, I’ve found that there are a few Java HTML parsers

  • 0

Searching SO and Google, I’ve found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it’s hard to find any information on the strengths and weaknesses of the various libraries. I’m hoping that some people have spent some comparing these libraries, and can share what they’ve learned.

Here’s what I’ve seen:

  • JTidy
  • NekoHTML
  • jsoup
  • TagSoup

And if there’s a major parser that I’ve missed, I’d love to hear about its pros and cons as well.

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T16:08:58+00:00Added an answer on May 15, 2026 at 4:08 pm

    General

    Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoup and HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br> by a XML-valid <br />), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.

    The only ones which jumps out are HtmlUnit and Jsoup.

    HtmlUnit

    HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It’s much more than alone a HTML parser. It’s a real "GUI-less webbrowser" and HTML unit testing tool.

    Jsoup

    Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.

    Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document know what a hell of pain it is to traverse the DOM using the verbose NodeList and Node APIs. True, XPath makes the life easier, but still, it’s another learning curve and it can end up to be still verbose.

    Here’s an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).

    String url = "http://stackoverflow.com/questions/3152138";
    Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
    XPath xpath = XPathFactory.newInstance().newXPath();
      
    Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
    System.out.println("Question: " + question.getFirstChild().getNodeValue());
    
    NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
    for (int i = 0; i < answerers.getLength(); i++) {
        System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
    }
    

    And here’s an example how to do exactly the same with Jsoup:

    String url = "http://stackoverflow.com/questions/3152138";
    Document document = Jsoup.connect(url).get();
    
    Element question = document.select("#question .post-text p").first();
    System.out.println("Question: " + question.text());
    
    Elements answerers = document.select("#answers .user-details a");
    for (Element answerer : answerers) {
        System.out.println("Answerer: " + answerer.text());
    }
    

    Do you see the difference? It’s not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).

    Summary

    The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 453k
  • Answers 453k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer I agree that the problem probably has little to do… May 15, 2026 at 9:36 pm
  • Editorial Team
    Editorial Team added an answer You can load the HTML with DOM's loadHTML then import… May 15, 2026 at 9:36 pm
  • Editorial Team
    Editorial Team added an answer In order to befriend a template, I think you'll need… May 15, 2026 at 9:35 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.