The Unicode Common Locale Data Repository (CLDR) has a wealth of information regarding the

Question

0

Asked: June 6, 20262026-06-06T07:09:06+00:00 2026-06-06T07:09:06+00:00

The Unicode Common Locale Data Repository (CLDR) has a wealth of information regarding the

0

The Unicode Common Locale Data Repository (CLDR) has a wealth of information regarding the relationship between languages and characters. For example, you can determine which characters are utilized in a particular language by looking at the misc.exemplarCharacters chart. The raw data for these charts are stored as XML files and the exemplar characters are stored as regular expressions according to the Unicode Regular Expressions standard UTS18.

Here’s a few examples of what UTS18 regex expressions look like:

1. [a à b c ç d e é è f g h i í ï j k l ŀ m n o ó ò p q r s t u ú ü v w x y z]
2. [অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ং \u0981 ঃ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড {ড\u09BC}ড় ঢ {ঢ\u09BC}ঢ় ণ ত থ দ ধ ন প ফ ব ভ ম য {য\u09BC} ৰ ল ৱ শ ষ স হ া ি ী \u09C1 \u09C2 \u09C3 ে ৈ ো ৌ \u09CD]
3. [a á b ɓ c d ɗ e é ɛ {ɛ\u0301} f g i í j k l m n {ny} ŋ o ó ɔ {ɔ\u0301} p r s t u ú ū w y]

I’m using PHP and SimpleXML to parse the XML data and isolate these regex strings. Now, I would like to match individual multi-byte characters to these regular expressions. I’m currently using the mb_ereg_match function, which yields one or more of the following warnings (depending on the regex):

mbregex compile err: premature end of char-class in ...
mbregex compile err: empty range in char class in ...
mbregex compile err: empty char-class in ...

Any ideas as to why this isn’t working?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T07:09:08+00:00

As suggested by Sergey, I added the following lines before calling the mb_ereg_match() function:

mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

This addition eliminated two of the warnings listed above. I was only left with the following warning:

mbregex compile err: empty char-class in ...

After some additional debugging, I discovered that a handful of the CLDR XML files do in fact contain empty regular expression strings. For example, in kn.xml we have the following line:

<exemplarCharacters type="auxiliary">[]</exemplarCharacters>

I believe these lines are erroneous, as the expected behavior would be to simply leave the line out altogether (which is mostly the case throughout the CLDR).

Thus, I was able to eliminate this last error by simply throwing out empty regex strings.

Hope this helps someone else!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The Unicode Common Locale Data Repository (CLDR) has a wealth of information regarding the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply