I have Unicode string and I want to compare with the following requirements. Confusable

Question

0

Asked: June 11, 20262026-06-11T14:28:03+00:00 2026-06-11T14:28:03+00:00

I have Unicode string and I want to compare with the following requirements. Confusable

0

I have Unicode string and I want to compare with the following requirements.

Confusable s [1] character should be consider the same character,
example: T (LATIN CAPITAL LETTER T U 0054) should be == T (GREEK CAPITAL LETTER TAU U03A4) etc

(* [1] example http://unicode.org/cldr/utility/confusables.jsp?a=TESTt&r=None*)

http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt

I will use the above file in order to make the code, but if there are already any free libraries I would prefer to use it.

I am thinking that the code would create a temporary ustring in which every confusable character would be replaced with the corresponding latin character.

In the real program I will be testing 10x5000x10000 strings containing one word each.

Test program:

 std::locale::global(std::locale(""));

 std::cout.imbue(std::locale());

 Glib::ustring s1,s2;

 s1="TEST";

 s2="TΕST";

 s1.normalize(Glib::NORMALIZE_NFKD    );

 s2.normalize(Glib::NORMALIZE_NFKD   );

 std::cout<<"1->true, 0->false  (s1==s2) =>  "<<(s1==s2)<<"\n";

Test program output:

1->true, 0->false  (s1==s2) =>  0

Ubuntu locale command Output:

Ubuntu 12.04 64 bit>$ locale  
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thank you for your time!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T14:28:05+00:00

As user1675224 says you should be using ICU rather than attempting to roll your own algorithm.

For example, to use uspoof_areConfusable:

UErrorCode status = 0;
USpoofChecker *sc = uspoof_open(&status);
int result = uspoof_areConfusable(sc, s1.data(), s1.length(), s2.data(), s2.length(), &status);
uspoof_close(sc);

If you’re comparing large numbers of strings against each other, you should convert them to their skeletons using uspoof_getSkeleton, and put that in a set or hash set.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have Unicode string and I want to compare with the following requirements. Confusable

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply