I have thousands of HTML files to process using Groovy/Java and I need to

Question

0

Asked: May 27, 20262026-05-27T01:36:06+00:00 2026-05-27T01:36:06+00:00

I have thousands of HTML files to process using Groovy/Java and I need to

0

I have thousands of HTML files to process using Groovy/Java and I need to produce XML at the end. Some of the files have the character escape sequence ’ in them. When I produce the output XML the subsequent parse of that XML is complaining about an illegal unicode character in the file. The sequence I am going through is HTML file->HTMLCleaner->SimpleXMLSerializer->XMLSlurper->CLOB (in HSQLDB)->ClobInputStream->FileWriter.

How do I get the correct character code in the output so that the parser doesn’t complain?

Note: This question has been heavily modified to correctly represent what the real problem was. The comments below refer to the original version.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T01:36:06+00:00

Editorial Team

2026-05-27T01:36:06+00:00Added an answer on May 27, 2026 at 1:36 am

The answer is that a java.io.FileWriter doesn’t use UTF-8 encoding by default. Instead use the following code to create the writer:

def writer = new OutputStreamWriter(new FileOutputStream(outputFile),"UTF-8")

Hat tip to http://www.malcolmhardie.com/weblogs/angus/2004/10/23/java-filewriter-xml-and-utf-8/ for the answer.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have thousands of HTML files to process using Groovy/Java and I need to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply