I have an application where I would like to use an XML file to store: (1) the original text of a document, and (2) several entities that “point into” the original text using character offsets. E.g.:
<Document>
<OriginalText>This is a test</OriginalText>
<Word start_offset="0" end_offset="4" id="w1"/>
<Word start_offset="6" end_offset="7" id="w2"/>
<Word start_offset="8" end_offset="9" id="w3"/>
<Word start_offset="10" end_offset="14" id="w4"/>
</Document>
However, I’m worried about a potential problem — I have no control over the input document’s contents, so it may contain either “\n” or “\r\n” newlines. However, the XML specification [1] says:
The XML processor MUST behave as if it
normalized all line breaks in external
parsed entities (including the
document entity) on input, before
parsing, by translating both the
two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
I.e., newlines get normalized before the application gets to see the XML file. Unfortunately, it seems to me like this may throw off the character offsets. E.g., the character that was at offset 173 before offsets were normalized might be at offset 168 after offsets are normalized. My questions:
-
Am I interpreting the XML spec correctly?
-
I assume that just encoding the newlines (i.e., replacing \r with 
) will not fix the problem, because the encoded characters will be replaced before the XML processor normalizes line breaks. Is that correct?
-
Can anyone recommend a good solution? One solution I’ve considered is to replace the \r characters that would otherwise get deleted during normalization with some other character (either a space, or some “special” character); but I’d prefer not to modify the original document text, if possible. Another possible solution would be to encode the original document (eg using base64 or uuencode), but I’d really rather not do that, as it would make the XML files more difficult to read & use.
(Using character offsets to point into the document is not a design decision that can be changed, since I need to integrate with other tools that use character offsets to point into the document text.)
The way I have understood the part of the specification you quoted is that all typed (literal)
CRcharacters get replaced and they get replaced before parsing. Thus anyCRthat is represented as a character reference
will not get replaced withLFsince replacement should be done before parsing (or it should work as if it would be done before parsing) and character references get converted to character data during the XML parsing. Note that alsoCRs inCDATAsections get replaced but then again, character references inCDATAsections will not get parsed to actual characters they reference.So you should be able to preserve your line feeds as they were if you serialize them as character references. However, be warned: I wouldn’t count on that all XML tools obey this convention. Also you might lose the
CRs if the parsed XML is sent to another tool which interprets the contents again.Also, indexing data by character positions sounds quite brittle to me. Please consider can you find another way to tokenize or segmentize your data. If you need to stick with character position based indexing, I would suggest normalizing the text data somehow. After all, line feeds are not the only possible point of failure. Others include for example accented characters and ligatures.