Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 534231
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T09:35:26+00:00 2026-05-13T09:35:26+00:00

The HTML document which I am parsing contains some ASCII control codes. I noticed

  • 0

The HTML document which I am parsing contains some ASCII control codes. I noticed that PHP’s DOMDocument parser truncates text nodes when it finds ASCII control characters within the node, such as

Device Control 0x13

End of Medium 0x19

File Separator 0x1C

Group Separator 0x1D

Is this a bug or a feature? Is there any way to have DOMDocument act otherwise? I resorted to remove this characters before DOM processing, but I wonder if that’s the right solution.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T09:35:27+00:00Added an answer on May 13, 2026 at 9:35 am

    Probably both a bug and a feature.

    XML 1.0 is very restrictive about the ASCII control characters that it will accept. So it seems like your DOMDocument is trying to protect you from yourself by truncating (although it should return some indication of a problem, so I’d call that a bug).

    XML 1.1 is less restrictive; the only thing that it doesn’t allow is NUL. So, one possible solution is to configure your DOMDocument object so that it knows it should be managing 1.1.


    Edit: it looks like you can pass the XML version number to the DOMDocument constructor (but I’m not a PHP programmer, so don’t know if I’m reading the docs correctly).


    Edit 2: I just reread your question, and realized that your parsing, not constructing. If you prepend a valid 1.1 prologue to the input, that should be a workaround. Or perhaps by constructing the DOMDocument with the correct version number, it will parse correctly without that prologue.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am writing some VBA code which manipulates an HTML document. The document is
I have XML tag that has the content which is HTML document. <xml-tag> <!--
I'm looking for a way to parse an xml/html document in ruby, which contains
I have a console application which is parsing HTML documents via the WebRequest method
I am using MSXML3 and have loaded an xml document which is a HTML
I have a single xml document (data.xml), which I display as HTML using an
Trying to parse an HTML document and extract some elements (any links to text
I'm trying BeautifulSoup for parsing html files which is encoded in UTF-8. But unfortunately,
I want to parse with XmlSlurper a HTML document which I read using HTTPBuilder.
I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element .

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.