This isn’t a repeat of a previous question, I have found out the issue is with the Process.
I have a problem with my program whereby special characters are seemingly lost in the InputStream of a Java Process.
The code I am using is as follows:
String command = "/usr/local/bin/getTitle <URL>";
Process shellCommand = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(shellCommand.getInputStream(), "UTF-8"));
String output = null;
while ((output = stdInput.readLine()) != null) {
System.out.println(output);
}
If I run the ‘command’ from the command line, I get the following output:
PSY_-_GANGNAM_STYLE_(강남스타일)_M_V
However, the output of System.out.println(output); is as follows:
PSY_-_GANGNAM_STYLE_()_M_V
And this completely breaks my program.
I’m completely stumped, and I haven’t found anything even remotely related to this in my search. Any help greatly appreciated! Thanks in advance.
UPDATE:
If I change command as follows:
String command="echo 'PSY_-_GANGNAM_STYLE_(강남스 타일)_M_V'";
Then when printing the output the special characters are displayed correctly. Does this help in understanding where the problem lies?
It seems pretty clear that this problem is caused by mismatching character encodings somewhere. The two places it could be are in the Reader stack that is reading from the external process, or the PrintStream stack for
System.out. (The latter seems unlikely).Here’s what I’d do:
Run the
localecommand from the command line to see what character encoding is used by your command shell.Check that the encoding is the same as the Java default character encoding
Check that they are both the same as the encoding you are using to read from the external process. (You have hard-wired that to “UTF-8” …)
If that doesn’t reveal the source of the problem, try replacing the
commandstring with"locale"to see what locale settings get propagated to the external process.And if that doesn’t work, try capturing the output from the external command as bytes, displaying them in hexadecimal, and trying to hand decode them as UTF-8 and as other possible character sets.