I have thousands of HTML files to process using Groovy/Java and I need to produce XML at the end. Some of the files have the character escape sequence ’ in them. When I produce the output XML the subsequent parse of that XML is complaining about an illegal unicode character in the file. The sequence I am going through is HTML file->HTMLCleaner->SimpleXMLSerializer->XMLSlurper->CLOB (in HSQLDB)->ClobInputStream->FileWriter.
How do I get the correct character code in the output so that the parser doesn’t complain?
Note: This question has been heavily modified to correctly represent what the real problem was. The comments below refer to the original version.
The answer is that a java.io.FileWriter doesn’t use UTF-8 encoding by default. Instead use the following code to create the writer:
def writer = new OutputStreamWriter(new FileOutputStream(outputFile),"UTF-8")Hat tip to http://www.malcolmhardie.com/weblogs/angus/2004/10/23/java-filewriter-xml-and-utf-8/ for the answer.