Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5972907
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T20:43:27+00:00 2026-05-22T20:43:27+00:00

I have an application where I would like to use an XML file to

  • 0

I have an application where I would like to use an XML file to store: (1) the original text of a document, and (2) several entities that “point into” the original text using character offsets. E.g.:

<Document>
  <OriginalText>This is a test</OriginalText>
  <Word start_offset="0" end_offset="4" id="w1"/>
  <Word start_offset="6" end_offset="7" id="w2"/>
  <Word start_offset="8" end_offset="9" id="w3"/>
  <Word start_offset="10" end_offset="14" id="w4"/>
</Document>

However, I’m worried about a potential problem — I have no control over the input document’s contents, so it may contain either “\n” or “\r\n” newlines. However, the XML specification [1] says:

The XML processor MUST behave as if it
normalized all line breaks in external
parsed entities (including the
document entity) on input, before
parsing, by translating both the
two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

I.e., newlines get normalized before the application gets to see the XML file. Unfortunately, it seems to me like this may throw off the character offsets. E.g., the character that was at offset 173 before offsets were normalized might be at offset 168 after offsets are normalized. My questions:

  1. Am I interpreting the XML spec correctly?

  2. I assume that just encoding the newlines (i.e., replacing \r with &#xD;) will not fix the problem, because the encoded characters will be replaced before the XML processor normalizes line breaks. Is that correct?

  3. Can anyone recommend a good solution? One solution I’ve considered is to replace the \r characters that would otherwise get deleted during normalization with some other character (either a space, or some “special” character); but I’d prefer not to modify the original document text, if possible. Another possible solution would be to encode the original document (eg using base64 or uuencode), but I’d really rather not do that, as it would make the XML files more difficult to read & use.

(Using character offsets to point into the document is not a design decision that can be changed, since I need to integrate with other tools that use character offsets to point into the document text.)

[1] http://www.w3.org/TR/REC-xml/#sec-line-ends

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T20:43:28+00:00Added an answer on May 22, 2026 at 8:43 pm

    The way I have understood the part of the specification you quoted is that all typed (literal) CR characters get replaced and they get replaced before parsing. Thus any CR that is represented as a character reference &#xD; will not get replaced with LF since replacement should be done before parsing (or it should work as if it would be done before parsing) and character references get converted to character data during the XML parsing. Note that also CRs in CDATA sections get replaced but then again, character references in CDATA sections will not get parsed to actual characters they reference.

    So you should be able to preserve your line feeds as they were if you serialize them as character references. However, be warned: I wouldn’t count on that all XML tools obey this convention. Also you might lose the CRs if the parsed XML is sent to another tool which interprets the contents again.

    Also, indexing data by character positions sounds quite brittle to me. Please consider can you find another way to tokenize or segmentize your data. If you need to stick with character position based indexing, I would suggest normalizing the text data somehow. After all, line feeds are not the only possible point of failure. Others include for example accented characters and ligatures.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Imagine you have a NB Platform application and you would like to use that
I have pom.xml that contains defined values. I would like to use one of
I have an application built for iPhone 2.0 but I would like to use
I have a pseudo-realtime data processing application where I would like to use LazyInit<double>
I have an application that I would like to embed inside our companies CMS.
I have a console application that I would like to run as 'NT AUTHORITY\NetworkService',
I have an application I would like to force SSL on the login page
I have an application where I would like to have mixed Java and Scala
I have an application where I would like to exchange information, managed via Core
I have CLI/MFC application and I would like to begin to learn how to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.