I am using HtmlCleaner library in order to parse/convert HTML files in java. It

Question

0

Editorial Team

Asked: June 2, 20262026-06-02T20:34:02+00:00 2026-06-02T20:34:02+00:00

I am using HtmlCleaner library in order to parse/convert HTML files in java. It

0

I am using HtmlCleaner library in order to parse/convert HTML files in java.

It seems that is not able to handle Spanish characters like ‘ÁáÉéÍíÑñÓóÚúÜü’

Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here’s the code I’m using to invoke it:

CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T20:34:03+00:00

HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it’s going wrong.

You can either

specify -Dfile.encoding=UTF-8 on your JVM start line
use the HtmlCleaner.clean() overload that accepts a character set
```
TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
```
(if you’ve got Google Guava in the project you can use Charsets.UTF_8 for the constant)
use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you’ve already constructed with the correct character set.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using HtmlCleaner library in order to parse/convert HTML files in java. It

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply