Good afternoon all,
I am building a function that takes a string as input, removes any unnatural combining diacritic characters from the string, and returns the modified string as input.
An unnatural combining diacritic sequence is a sequence of unicode code points that when combined, produces output that does not belong to any language under the sun (ancient scripts/languages are considered natural languages).
For example, given the String input:
"aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa" //code points 0061 0061 0061 0300 0301 0302 0303 0304 0305 0306 0307 0308 0309 030a 030b 030c 030d 030e 030f 0310 0311 0312 0313 0314 0315 0316 0317 0318 0319 031a 031b 031c 031d 031e 031f 0320 0321 0322 0323 0324 0325 0326 0327 0328 0329 032a 032b 032c 032d 032e 032f 032f 0330 0331 0332 0333 0334 0335 0336 0337 0338 0339 033a 033b 033c 033d 033e 033f 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 034a 034b 034c 034d 034e 0360 0361 0061 0061
, the function should return the result aaàaa (code points 0061 0061 0061 0300 0061 0061),
Since à́ (code points 0061 0300 0301) isn’t a character in any natural language. In other words:
assert F("aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa").equals("aaàaa");
Or for source code saved using latin charsets:
assert F("\u0061\u0061\u0061\u0300\u0301\u0302\u0303\u0304\u0305\u0306\u0307\u0308\u0309\u030a\u030b\u030c\u030d\u030e\u030f\u0310\u0311\u0312\u0313\u0314\u0315\u0316\u0317\u0318\u0319\u031a\u031b\u031c\u031d\u031e\u031f\u0320\u0321\u0322\u0323\u0324\u0325\u0326\u0327\u0328\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u032f\u0330\u0331\u0332\u0333\u0334\u0335\u0336\u0337\u0338\u0339\u033a\u033b\u033c\u033d\u033e\u033f\u0340\u0341\u0342\u0343\u0344\u0345\u0346\u0347\u0348\u0349\u034a\u034b\u034c\u034d\u034e\u0360\u0361\u0061\u0061").equals("\u0061\u0061\u0061\u0300\u0061\u0061");
How do we go about determining if a sequence of characters or a sequence of unicode code points are natural ?
Or rather, is there a limit to how many combining diacritic characters a character belonging to a natural language will use?
Unicode 6.0:
There is unlikely to be enough information in the Unicode data to do this algorithmically.
There are some rules for canonical composition/decomposition that you could use to determine if a sequence is a “natural” sequence. For example, mapping U+0065 U+0301 to U+00E9 (é.) But this won’t work for every case.
Beyond that, I’m not sure what you could do without using some form of validation table built by experts or generated from some corpus of language data.