We’re doing the following:
- Open a Reader for a file, using some specified encoding.
- Read in each line, parsing it as CSV.
-
For some of the columns in the CSV data, pass it to JSoup to clean out HTML as below:
public String apply(@Nullable String input) { Document document = Jsoup.parse(input); return document.text(); }
This works great, except in the presence of numeric character references, such as  . What seems to be happening is that since we necessarily must do the JSoup call after we’ve figured out the encoding (to get the CSV parsing to work), when JSoup gets round to converting hard-coded bytes into characters, we’re working with the wrong character set. Byte 160 (0xa0) is non-breaking space in windows-1252, but is not a valid Unicode character so gives us bad data when JSoup is replacing the numeric character reference with a byte.
Is there a way around this? It would require JSoup to be given a ‘source encoding’ for numeric character references, or something like that.
Try calling the following before
text():For more output settings see the javadoc.