Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6756595
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T13:32:36+00:00 2026-05-26T13:32:36+00:00

Currently, I am trying to clean up an HTML file using JTidy, convert it

  • 0

Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

Note that – in contrast to the DOM parser – JTidy finishes its work within a second. Therefore, I suspect that I’m somehow misusing the DOM API.

Thanks in advance for any suggestions on this one!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T13:32:36+00:00Added an answer on May 26, 2026 at 1:32 pm

    Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am currently trying to learn about OpenGL ES for the iPhone and following
I am trying to parse some HTML snippets and want to clean them up
I'm trying to clean some HTML with libtidy (C language), the problem is: I
I'm currently using innerHTML to retrieve the contents of an HTML element and I've
I am currently trying to create some nested lists to display the following... A...R...X
I am trying to clean up some HTML values that are imported from other
Currently trying at add ajax to a site, after much reading I discovered that
Im currently trying to downlaod a audio track from a WCF, i need some
Im currently trying to learn some stuff about encryption, it's algorithms and how it
I'm currently trying to add an MSChart to a partial view in ASP.NET MVC

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.