I have two documents: Document 1 (input) Document 2 (output) Document 2 is the

Question

0

Asked: June 8, 20262026-06-08T20:34:56+00:00 2026-06-08T20:34:56+00:00

I have two documents: Document 1 (input) Document 2 (output) Document 2 is the

0

I have two documents:

Document 2 is the result of passing Document 1 through a transformation process which leaves any content and formatting intact (verified by side-by-side compare in Word).

However, the process removes many id numbers from the .docx files.

For example,

      <w:p w:rsidP="00B600D6" w:rsidR="00F55D78" w:rsidRDefault="00B600D6">

becomes

      <w:p>

according to a dump of each document via the following code:

Body body = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
Node node = org.docx4j.XmlUtils.marshaltoW3CDomDocument(body).getDocumentElement();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(new DOMSource(node), 
             new StreamResult(new OutputStreamWriter(System.out, "UTF-8")));

Using the docx4j Differencer comparison method recommended here, everything (except the first line which has no formatting applied) is shown as a modification.

Question is: Are the diffs a result of the missing id’s, the formatting or something else?

In case it’s important, we’re using docx4j in this context to perform automated sanity/regression tests on our round-tripping proceess (i.e. apply the “loss-less” process and expect no differences)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T20:34:58+00:00

Editorial Team

2026-06-08T20:34:58+00:00Added an answer on June 8, 2026 at 8:34 pm

Disclosure: I work on docx4j

If the only difference between paragraphs is the rsid attributes, they will still be detected as different.

You could “clean” the documents before performing the comparison, so that neither docx has rsid attributes. See the Filter sample.

By the way, an easier way to see the XML for an object (eg a single paragraph, or the entire body) is to use XmlUtils.marshaltoString

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two documents: Document 1 (input) Document 2 (output) Document 2 is the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply