I’m looking for a way to determine that character set associated with a given language code. For example if I passed in “en” for english, it might return me the unicode characters for a-zA-Z. An API on Apple’s platforms would be ideal, but I’d settle for an explanation if such a thing is possible in Unicode that I could code up myself. Maybe like character classes or something.
Share
There are sets of characters used in different languages in the CLDR database. Its format is XML-based LDML, but you might find alternate derived formats or APIs for it, and you might find ICU applicable.
The sets are specified in character elements, and you can find summary charts of existing content, though in a rather awkward format (very wide table).
Perhaps the best way to quickly check whether the CLDR data on characters is useful for your purposes is to look at the data for some locales. The root locale data contains (as part of a large table) the following information about the English locale:
I think this demonstrates that the sets are generally too broad. For example, the main set (of letters) for English does not contain even “ë” (think about Brontë), and the auxiliary set contains, in addition to letters commonly used in English, letters that only occur in truly foreign words, like “ō”.
There is a rather vague description of what these sets are for. Different use cases would require different approaches. For example, it would be natural to use the union of these sets to decide whether a font is suitable for texts in a given language (i.e., it contains all of the characters, in acceptable shape). But this would in practice exclude fonts that are just fine but lack a glyph for a very rarely used characters. Similarly, if you use information to decide which character encodings can be used, you would end up with the conclusion that only Unicode encodings are acceptable for English.
To conclude, the CLDR data in characters is a useful compilation but should be used with caution and care.