I have Unicode string and I want to compare with the following requirements.
Confusable s [1] character should be consider the same character,
example: T (LATIN CAPITAL LETTER T U 0054) should be == T (GREEK CAPITAL LETTER TAU U03A4) etc
(* [1] example http://unicode.org/cldr/utility/confusables.jsp?a=TESTt&r=None*)
http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt
I will use the above file in order to make the code, but if there are already any free libraries I would prefer to use it.
I am thinking that the code would create a temporary ustring in which every confusable character would be replaced with the corresponding latin character.
In the real program I will be testing 10x5000x10000 strings containing one word each.
Test program:
std::locale::global(std::locale(""));
std::cout.imbue(std::locale());
Glib::ustring s1,s2;
s1="TEST";
s2="TΕST";
s1.normalize(Glib::NORMALIZE_NFKD );
s2.normalize(Glib::NORMALIZE_NFKD );
std::cout<<"1->true, 0->false (s1==s2) => "<<(s1==s2)<<"\n";
Test program output:
1->true, 0->false (s1==s2) => 0
Ubuntu locale command Output:
Ubuntu 12.04 64 bit>$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Thank you for your time!
As user1675224 says you should be using ICU rather than attempting to roll your own algorithm.
For example, to use
uspoof_areConfusable:If you’re comparing large numbers of strings against each other, you should convert them to their skeletons using
uspoof_getSkeleton, and put that in a set or hash set.