Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6841187
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T23:56:09+00:00 2026-05-26T23:56:09+00:00

I’m using NekoHTML to clean up some HTML, and then feeding it to XOM

  • 0

I’m using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.

Here’s a relevant example of the input HTML (most of the <head> cut for clarity):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
    <script type="text/JavaScript">
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    </script>

Here’s the code:

// XOMSafeSAXParser is the Neko SAXParser extended to allow 
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();

Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);

Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();

Here’s the corresponding output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
    <HEAD>
        <SCRIPT type="text/JavaScript"> &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; </SCRIPT>
    </HEAD>

When I extract the script element from the XOM document, it looks like it’s already been mangled (the SCRIPT element has one Text node as a child, not the sequence of Texts and Comments I would expect), so I don’t think it’s the Serializer that’s going wrong.

Now, I don’t expect the line breaks to be preserved and in fact I’m going to throw the script tags out anyway, but there are other places where I’d like comments to be preserved or at minimum like to be able to get text without escaped comments embedded in it.

Any ideas?


Update: NekoHTML was mangling some tags, so I switched to JTidy, and I have the same problem. Interestingly, though, it’s only a problem for the script tag in the header; other comments come through fine. And there are weird extra JavaScript comments that I suspect (hope and pray) are JTidy’s fault.

    <script type="text/JavaScript"> // &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; // </script>

It looks as though what JTidy’s doing is converting <script> contents to CDATA; when I send JTidy’s raw outputut to stdout, I get this:

<script type="text/JavaScript">
//<![CDATA[
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    //]]>
</script>
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T23:56:10+00:00Added an answer on May 26, 2026 at 11:56 pm

    All right. I seem to have found the explanation at least for the JTidy case:

    the basic issue is that browser scripts will often contain special XML
    characters: '&', '<', ']]>' and '<' + '/' + Letter. If these are escaped to make XML processors happy, it will break the
    script. The agreed solution is to place source within a CDATA
    section. This is now done for both and tags. So far,
    so good. But there are a number open issues and possible unintended
    consequences. … script source is often embedded in HTML
    comments to prevent parsing by older browsers that do not support
    Javascript.

    HTML comments in general are okay; it’s just HTML comments inside <script> tags that get mangled, because they’re turned into (and escaped within) CDATA. XOM, in turn, merges CDATA into Text.

    Technically, I think this means JTidy is broken, but it’s good enough for my purposes since I don’t need the <script> tags at all.

    Still, if anybody has a solution that gets me out what I put in, I’d still like to hear it.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
i got an object with contents of html markup in it, for example: string
I have thousands of HTML files to process using Groovy/Java and I need to
I have some data like this: 1 2 3 4 5 9 2 6
That's pretty much it. I'm using Nokogiri to scrape a web page what has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.