Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8517705
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T05:46:47+00:00 2026-06-11T05:46:47+00:00

We are developing a system to do fuzzy matching on over 50 international languages

  • 0

We are developing a system to do fuzzy matching on over 50 international languages using the UTF-8, UTF-16, and UTF-32 Unicode character standard. So far, we have been able to use Levenshtein distance to detect misspellings of German Unicode extended character words.

We would like to extend this system to handle Mandarin Chinese ideographs represented in Unicode. How would we perform Levenshtein distance calculation between similar Chinese characters?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T05:46:48+00:00Added an answer on June 11, 2026 at 5:46 am

    Firstly, just to clarify: A Chinese character is not as such equivalent to a German or English word. Most of the things you’d consider as words (using a semantic or syntactic definition of “word”) consist of 1-3 characters. It is straightforward to apply Levenshtein distance to such character sequences by representing them as sequences of UCS-2 or UCS-4 code points. As most words are short (esp. words of length 1 or 2 characters), it may be of limited use, though.

    However, as your question is specifically about the edit distance between individual characters, I believe a different approach is required, and it may be very difficult indeed.

    For a start, you’d have to represent each character as a sequence of the components / strokes it consists of. There are two problems:

    • Some components consist themselves of even smaller components, so how to break a character down into “atomic” components is not uniquely defined. If you do it down to the level of individual strokes, you’d need a characterisation of every single stroke (position within the character, shape, direction etc.). I don’t think anyone as every done this (I’d be most interested if anyone tells me otherwise).

    • You’d need to put the strokes or components into an order. The obvious candidate is the canonical stroke order of the character, which is described in lexica, and there are even dictionary websites with animated stroke order diagrams. However, the data sources I know (for Japanese), generate these animations as sequences of bitmap graphics; I have never seen human or machine readable codes that represent the sequence of strokes (or even the names of individual strokes) in a form that is suitable for edit distance calculation.

    One final thing you could try, though, is to render the character glyphs and calculate the edit distance based on how many pixels (or vectors) need to be changed to turn one character into another. I once did this for Latin characters and character combinations (on pixel basis) in the context of OCR post-correction, and the results were quite encouraging.


    A quick answer to larsmans comment below: There are two related concepts defined by the Unicode Standard (in the below I refer to the 6.0 version, chapter 12):

    1. An index based on radicals and stroke counts. Each Han character consists of several components, one of which is the radical. A radical/stroke count index is a character list sorted by radical (i.e. all characters that share the same radical grouped together), and each radical-specific group internally sorted by the number of strokes used in the rest of the character. Unfortunately, even this is not uniquely defined – there are characters whose radical is defined differently by different traditional lexica, and stroke counting can also be difficult. Here is what the Unicode Standard says:

      To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. […]
      The most influential authority for radical-stroke information is the eighteenth-century
      KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214
      KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. […]
      The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard
      follows a number of different sources for radical-stroke classification. Where two sources
      are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.

      Note that even if we assume the radical/stroke index to be unambiguous and correct, it wouldn’t suffice as a source of information to transform a character into a sequence of components, because the only component of the character fully described by this is the radical.

    2. Ideographic description sequences (section 12.2): Unicode defines code points for the basic components of characters (most of them can themselves be used as standalone characters anyway), and there are codepoints used to glue those together to form a sequence of components that describes the composition of a more complex character. So this works in a way similar to combining characters, but there are important differences:

      1. The order of components is not uniquely defined
      2. There is no definition of a rendering mechanism for such sequences
      3. There is no mapping from ordinary characters to corresponding ideographic description sequences (although the Standard mentions that such mappings, to some extent, exist in the sources they used to compile the Han character set).

      The Standard suggests that ideographic description sequences be used to describe complex or rare charactes that are not represented by any existing code point; but it explicitly discourages the use of description sequences in place of ordinary characters:

      In particular, Ideographic Description Sequences should not be used to provide alternative
      graphic representations of encoded ideographs in data interchange. Searching, collation,
      and other content-based text operations would then fail.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Okay so I have been developing a system so far in main memory that
I am developing authentication system using Oauth protocol and as per the Oauth standard,
I'm developing a system, and I have build a code generator that emits a
My team is in the process of developing a system where we're using Unity
We are developing a system for Android devices. For that we are using FC16,
I'm developing a system in CakePHP, using Git as the version control system. I
We are developing a system using c# in Visual Studio 2008. When I configure
I am developing a system that will have a database backend. I am intending
i am using the jquery for developing my system. for particular functionality i am
I am developing a system using MySQL queries written by another programmer, and am

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.