This question concerns a Tomcat 7 web application, which is connected to a MySQL (5.5.16) database.
When I open a zip file, That has filenames encoded in windows-1252 charset, the characters seem to be interpreted correctly by Java:
ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
Enumeration entries = zf.entries();
while( entries.hasMoreElements() ) {
ZipEntry ze = ( ZipEntry ) entries.nextElement();
if( ! ze.isDirectory() ) {
String name = ze.getName();
System.out.println( name ); //prints correct filenames, e.g. café.pdf
}
}
Omitting the Charset object in the ZipFile constructor would cause an exception.
The filenames in the zip file are printed correctly to standard output, including diacritics.
But, when I subsequently try to store the filename in a database, the e-acute is replaced with a question mark (as seen with the mysql console client).
I had no problems inserting special characters from the web application into MySQL before.
When I execute an INSERT with é in Java source code:
statement.executeUpdate( "insert into files (filename) values ('café.pdf')" );
the é shows up well in MySQL.
Also, my log file shows a comma instead of é: caf‚.pfd
Does anyone know what could be happening here?
The issue is resolved. This post suggested that the encoding of filenames in a
zipfile might not bewindows-1252but ratherIBM437. Changing theCharsetfrom:to
gave the desired result: when saving the acquired filename in MySQL, it was stored correctly with é.
What went wrong?
Printing out the filenames contained in the zip file to standard output with
made me wrongly assume that the filenames in the zip file were interpreted well: when I used
windows-1252encoding to open the zip file, the filename was printed to standard output nicely with diacritic: café.pdf. Using other character encodings, different symbols appeared instead of the é.But when printing the
Unicodevalue of the é-charwith the help of this answer, I was able to see that when opening the zip file withwindows-1252encoding, the actual Unicode value was NOT\u00e9(latin small letter e with acute), but\u201a(single low-9 quotation mark). When I opened theZipFilewithIBM437charset the correct Unicode value DID appear.Of course when printing a
Stringto standard output withPrintStream, thePrintStreamis also associated with a certain character encoding. From thePrintStreamJavadoc:I am working on Windows XP.
When I created a new
PrintStreameverything made sense: opening the zip file with
IBM437character encoding, and using the new PrintStream, é was printed correctly.There Ain’t No Such Thing As Plain Text.