I am looking for a way to compare two Java strings that are lexicographically equivalent but not identical at the byte level.
More precisely take the following file name “baaaé.png”, at the byte level it can be represented in two different ways:
[98, 97, 97, 97, -61, -87, 46, 112, 110, 103] –> the “é” is encoded with 2 bytes
[98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103] –> the “é” is encoded with 3 bytes
byte[] ch = {98, 97, 97, 97, -61, -87, 46, 112, 110, 103};
byte[] ff = {98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103};
String st = new String(ch,"UTF-8");
String st2 = new String(ff,"UTF-8");
System.out.println(st);
System.out.println(st2);
System.out.println(st.equals(st2));
Will generate the following output:
baaaé.png
baaaé.png
false
Is there a way to do the compare so that the equals method returns true ?
You can use the Collator class with an applicable strength to normalize out things like different accent marks. this will allow you to compare strings successsfully.
In this case, a US locale and a TERTIARY strength is enough to get the strings to be equal
outputs
You can also use Java’s Normalizer class to convert between different forms of Unicode. This will transform your strings, but they will end up being the same, allowing you to use standard string tools to do the comparison
Finally, take might want to take a look at the ICU (International Components for Unicode) project, which provides lots of tools for working with Unicode strings in lots of different ways.