We’re doing the following: Open a Reader for a file, using some specified encoding.

Question

0

Editorial Team

Asked: June 12, 20262026-06-12T11:18:09+00:00 2026-06-12T11:18:09+00:00

We’re doing the following: Open a Reader for a file, using some specified encoding.

0

We’re doing the following:

Open a Reader for a file, using some specified encoding.
Read in each line, parsing it as CSV.

For some of the columns in the CSV data, pass it to JSoup to clean out HTML as below:

public String apply(@Nullable String input) {
    Document document = Jsoup.parse(input);

    return document.text();
}

This works great, except in the presence of numeric character references, such as  . What seems to be happening is that since we necessarily must do the JSoup call after we’ve figured out the encoding (to get the CSV parsing to work), when JSoup gets round to converting hard-coded bytes into characters, we’re working with the wrong character set. Byte 160 (0xa0) is non-breaking space in windows-1252, but is not a valid Unicode character so gives us bad data when JSoup is replacing the numeric character reference with a byte.

Is there a way around this? It would require JSoup to be given a ‘source encoding’ for numeric character references, or something like that.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T11:18:10+00:00

Editorial Team

2026-06-12T11:18:10+00:00Added an answer on June 12, 2026 at 11:18 am

Try calling the following before text():

document.outputSettings().charset("windows-1252");

For more output settings see the javadoc.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We’re doing the following: Open a Reader for a file, using some specified encoding.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply