UPDATED WITH SOLUTION, see at bottom
Requirement:
Process a ZIP file in Java SE 6 that contains files with special characters in the file names. As the encoding (of the ZIP producer) is not UTF-8, special characters get encoded. Therefore I would like to correct special characters into their proper code.
Issue:
The ZIP contains a file called abcüabc.txt .
The entry gets processed via java.util.zip.ZipEntry and when printing out single characters I see these characters (bytes):
ü gets encoded as
u followed by a
¨
Question:
So I would like to know how I can replace that u¨ into ü or maybe ue:
What I already tried and did not work out:
name.replaceAll("u\\¨", "ue");
or
name.replaceAll("ü", "ue");
Original Source Code (not working):
InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName(); // reading abcüabc.txt
System.out.println("pos 3: "+name.charAt(3));
System.out.println("pos 4: "+name.charAt(4));
System.out.println("is equal to ¨: "+Character.toString(name.charAt(4)).equals("¨"));
}
Output:
pos 3: u
pos 4:¨
is equal to ¨: false
Notes on my environment:
Zip produced under Mac OS X 10.6.8
Java SE 6: Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)
SOLUTION
Obviously, the ZIP producer (in my case Mac OSX) converts special characters into a decomposed format. So a ü gets decomposed into u¨.
While extracting the file names form the ZIP, we would like to convert back from the decomposed to the composed format, so we only have to insert a normalization into our source code from above:
InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName(); // reading abcüabc.txt
System.out.println("pos 3: "+name.charAt(3));
System.out.println("pos 4: "+name.charAt(4));
System.out.println("contains ü: "+name.contains("ü"));
name = Normalizer.normalize(name, Form.NFC);
System.out.println("contains ü: "+name.contains("ü"));
}
Output:
pos 3: u
pos 4:¨
contains ü: false
contains ü: true
That’s not a
¨(U+00A8 DIAERESIS), but the U+0308 COMBINING DIAERESIS.The character is splitted this way because Mac Os stores file names in the Normalization Form D, which Decomposes characters like this.
You can compose it back like so:
More about normalization forms
The difference between the diaeresises is how they modify or don’t modify the previous base character: