I need to create an XML file using DOM under Eclipse (for Java) and using the following code :
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File("output.xml"));
transformer.transform(source, result);
My XML’s first line is :
<?xml version="1.0" encoding="UTF-8"?>
and not :
<?xml version="1.0"?>
My questions are :
-
What is the difference between those two declarations ?
-
How can I generate the XML file using the header :
<?xml version="1.0"?>
Regards
In the modern world, text files have an “encoding”, which defines how characters are represented in the file. You won’t see this if your file contains ONLY plain ASCII characters (0x01 thru 0x7f) but if you need to represent anything else, such as symbols or accented characters, then a consumer of the file needs to know how those characters are encoded.
There are several different ways to encode extended characters, the most common ones being ISO-8859-x (where x depends on the language) and Unicode, which assigns a unique number to every possible character. The ISO code pages use the range 0x80 thru 0xFF for extended characters. UTF-8 is a system of representing Unicode characters (aka “code points”) of arbitrary length in multiple 8-bit bytes. The same extended character (for example e-circumflex) will have different representations in different encodings.
The serializer you used is configured to output UTF-8 encoding. A consumer of that file must be aware that UTF-8 encoding was used, or risk mangling the data. You have probably seen web pages containing black-diamond characters, or text where things like apostrophes or other special characters are replaced with 2 weird characters. These are symptoms of incompatible encoding/decoding.
There is probably a way to force the serializer to omit the encoding declaration, but if you do the consumer of the file may not be able to decode it correctly, since it will have to guess about the encoding.