Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7188511
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T19:06:03+00:00 2026-05-28T19:06:03+00:00

Good afternoon all, I am building a function that takes a string as input,

  • 0

Good afternoon all,

I am building a function that takes a string as input, removes any unnatural combining diacritic characters from the string, and returns the modified string as input.

An unnatural combining diacritic sequence is a sequence of unicode code points that when combined, produces output that does not belong to any language under the sun (ancient scripts/languages are considered natural languages).

For example, given the String input:

   "aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa" //code points 0061 0061 0061 0300 0301 0302 0303 0304 0305 0306 0307 0308 0309 030a 030b 030c 030d 030e 030f 0310 0311 0312 0313 0314 0315 0316 0317 0318 0319 031a 031b 031c 031d 031e 031f 0320 0321 0322 0323 0324 0325 0326 0327 0328 0329 032a 032b 032c 032d 032e 032f 032f 0330 0331 0332 0333 0334 0335 0336 0337 0338 0339 033a 033b 033c 033d 033e 033f 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 034a 034b 034c 034d 034e 0360 0361 0061 0061

, the function should return the result aaàaa (code points 0061 0061 0061 0300 0061 0061),

Since à́ (code points 0061 0300 0301) isn’t a character in any natural language. In other words:

  assert F("aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa").equals("aaàaa");

Or for source code saved using latin charsets:

 assert F("\u0061\u0061\u0061\u0300\u0301\u0302\u0303\u0304\u0305\u0306\u0307\u0308\u0309\u030a\u030b\u030c\u030d\u030e\u030f\u0310\u0311\u0312\u0313\u0314\u0315\u0316\u0317\u0318\u0319\u031a\u031b\u031c\u031d\u031e\u031f\u0320\u0321\u0322\u0323\u0324\u0325\u0326\u0327\u0328\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u032f\u0330\u0331\u0332\u0333\u0334\u0335\u0336\u0337\u0338\u0339\u033a\u033b\u033c\u033d\u033e\u033f\u0340\u0341\u0342\u0343\u0344\u0345\u0346\u0347\u0348\u0349\u034a\u034b\u034c\u034d\u034e\u0360\u0361\u0061\u0061").equals("\u0061\u0061\u0061\u0300\u0061\u0061");

How do we go about determining if a sequence of characters or a sequence of unicode code points are natural ?

Or rather, is there a limit to how many combining diacritic characters a character belonging to a natural language will use?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T19:06:04+00:00Added an answer on May 28, 2026 at 7:06 pm

    Unicode 6.0:

    All combining characters can be applied to any base character and can, in principle, be used
    with any script. As with other characters, the allocation of a combining character to one
    block or another identifies only its primary usage; it is not intended to define or limit the
    range of characters to which it may be applied. In the Unicode Standard, all sequences of
    character codes are permitted.

    This does not create an obligation on implementations to support all possible combinations
    equally well. Thus, while application of an Arabic annotation mark to a Han character
    or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering
    or to make much sense.

    There is unlikely to be enough information in the Unicode data to do this algorithmically.

    There are some rules for canonical composition/decomposition that you could use to determine if a sequence is a “natural” sequence. For example, mapping U+0065 U+0301 to U+00E9 (é.) But this won’t work for every case.

    Beyond that, I’m not sure what you could do without using some form of validation table built by experts or generated from some corpus of language data.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Good afternoon all, I was taught that when a function returns, The variables (within
Good afternoon all. I am aware that if we close() an java.io.OutputStream , it
Good afternoon all, I was wondering what's the reason that public class test<T> {
Good afternoon all, I'm using a java.lang.StringBuilder to store some characters. I have no
Good Afternoon All, I have a wizard control that contains 20 textboxes for part
Good afternoon all, I was wondering why is it that android.app.Activity.onTrimMemory couldn't be overridden?
Good Afternoon All, I have written an SSIS 2005 package that contains a conditional
afternoon all. Iv'e come across some mathematical problems that im not too good at.
Good afternoon, I wish to have a script that will look for all files
Good afternoon all. I have a page that displays data in a gridview based

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.