Good afternoon all, I am building a function that takes a string as input,

Question

0

Asked: May 28, 20262026-05-28T19:06:03+00:00 2026-05-28T19:06:03+00:00

Good afternoon all, I am building a function that takes a string as input,

0

Good afternoon all,

I am building a function that takes a string as input, removes any unnatural combining diacritic characters from the string, and returns the modified string as input.

An unnatural combining diacritic sequence is a sequence of unicode code points that when combined, produces output that does not belong to any language under the sun (ancient scripts/languages are considered natural languages).

For example, given the String input:

   "aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa" //code points 0061 0061 0061 0300 0301 0302 0303 0304 0305 0306 0307 0308 0309 030a 030b 030c 030d 030e 030f 0310 0311 0312 0313 0314 0315 0316 0317 0318 0319 031a 031b 031c 031d 031e 031f 0320 0321 0322 0323 0324 0325 0326 0327 0328 0329 032a 032b 032c 032d 032e 032f 032f 0330 0331 0332 0333 0334 0335 0336 0337 0338 0339 033a 033b 033c 033d 033e 033f 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 034a 034b 034c 034d 034e 0360 0361 0061 0061

, the function should return the result aaàaa (code points 0061 0061 0061 0300 0061 0061),

Since à́ (code points 0061 0300 0301) isn’t a character in any natural language. In other words:

  assert F("aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa").equals("aaàaa");

Or for source code saved using latin charsets:

 assert F("\u0061\u0061\u0061\u0300\u0301\u0302\u0303\u0304\u0305\u0306\u0307\u0308\u0309\u030a\u030b\u030c\u030d\u030e\u030f\u0310\u0311\u0312\u0313\u0314\u0315\u0316\u0317\u0318\u0319\u031a\u031b\u031c\u031d\u031e\u031f\u0320\u0321\u0322\u0323\u0324\u0325\u0326\u0327\u0328\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u032f\u0330\u0331\u0332\u0333\u0334\u0335\u0336\u0337\u0338\u0339\u033a\u033b\u033c\u033d\u033e\u033f\u0340\u0341\u0342\u0343\u0344\u0345\u0346\u0347\u0348\u0349\u034a\u034b\u034c\u034d\u034e\u0360\u0361\u0061\u0061").equals("\u0061\u0061\u0061\u0300\u0061\u0061");

How do we go about determining if a sequence of characters or a sequence of unicode code points are natural ?

Or rather, is there a limit to how many combining diacritic characters a character belonging to a natural language will use?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T19:06:04+00:00

Unicode 6.0:

All combining characters can be applied to any base character and can, in principle, be used
with any script. As with other characters, the allocation of a combining character to one
block or another identifies only its primary usage; it is not intended to define or limit the
range of characters to which it may be applied. In the Unicode Standard, all sequences of
character codes are permitted.

This does not create an obligation on implementations to support all possible combinations
equally well. Thus, while application of an Arabic annotation mark to a Han character
or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering
or to make much sense.

There is unlikely to be enough information in the Unicode data to do this algorithmically.

There are some rules for canonical composition/decomposition that you could use to determine if a sequence is a “natural” sequence. For example, mapping U+0065 U+0301 to U+00E9 (é.) But this won’t work for every case.

Beyond that, I’m not sure what you could do without using some form of validation table built by experts or generated from some corpus of language data.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Good afternoon all, I am building a function that takes a string as input,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply