Sorry to double post. But my earlier post was based on Flex:
Flex TextArea – copy/paste from Word – Invalid unicode characters on xml parsing
But now I’m posting this on the Java side.
The issue is:
We have an email functionality (part of our application) where we create an XML string & put it on the queue. Another application picks it up, parses the XML & sends out emails.
We get an XML parser exception when the email text (<BODY>....</BODY) is copy/pasted from Word:
Invalid character in attribute value BODY (Unicode: 0x1A)
As we use Java as well, I’m trying to remove the invalid characters from the String using:
body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");
//Strip invalid characters
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
for (int i = 0; i < in.length(); i++) {
//NOTE: No IndexOutOfBoundsException caught here; it should not happen.
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
//Strip once more
private String stripNonValidXMLCharacter(String in) {
if (in == null || ("".equals(in))) {
return null;
}
StringBuffer out = new StringBuffer(in);
for (int i = 0; i < out.length(); i++) {
if (out.charAt(i) == 0x1a) {
out.setCharAt(i, '-');
}
}
return out.toString();
}
//Replace the special characters if any
emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C"
+ "\\u000E-\\u001F"
+ "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
+ "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
emailText = emailText.replaceAll(
"[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
emailText = emailText.replaceAll("\\p{C}", "");
But they still do not work. Also the XML string starts with:
<?xml version="1.0" encoding="UTF-8"?>
<EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">
I think the issue occurs when there are multiple Tabs in the Word doc. Like for eg.
Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>
The resulting xml string is:
<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t@t.com" DEST="t@t.com" CC="" BCC="t@t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems. The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.? It still is the same. ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>
Please note then the “?” is where there are multiple tabs in the Word doc. Hope my question is clear & someone can help in resolving the issue
Thanks
The invalid (hidden) character was from the UI (Flex TextArea). So had to take care of that in the UI so that it does not pass over to Java as well. Handled & removed it using the chagingHandler in the Flex textArea to restrict the characters.