Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7959717
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T04:37:02+00:00 2026-06-04T04:37:02+00:00

I am using the Levenshtein distance to find similar strings after OCR. However, for

  • 0

I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different.

For example the string Co will return these matches:

CY (1)
CZ (1)
Ca (1)

Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I’d like to refine query result by ordering by visual similarity. In order to calculate this similarity a I’d like to use standard sans-serif font, like Arial.

Is there a library I can use for this purpose, or how could I implement this myself? Alternatively, are there any string similarity algorithms that are more accurate than the Levenshtein distance, which I could use in addition?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T04:37:02+00:00Added an answer on June 4, 2026 at 4:37 am

    If you’re looking for a table that will allow you to calculate a ‘replacement cost’ of sorts based on visual similarity, I’ve been searching for such a thing for awhile with little success, so I started looking at it as a new problem. I’m not working with OCR, but I am looking for a way to limit the search parameters in a probabilistic search for mis-typed characters. Since they are mis-typed because a human has confused the characters visually, the same principle should apply to you.

    My approach was to categorize letters based on their stroke components in an 8-bit field. the bits are, left to right:

    7: Left Vertical
    6: Center Vertical
    5: Right Vertical
    4: Top Horizontal
    3: Middle Horizontal
    2: Bottom Horizontal
    1: Top-left to bottom-right stroke
    0: Bottom-left to top-right stroke
    

    For lower-case characters, descenders on the left are recorded in bit 1, and descenders on the right in bit 0, as diagonals.

    With that scheme, I came up with the following values which attempt to rank the characters according to visual similarity.

    m:               11110000: F0
    g:               10111101: BD
    S,B,G,a,e,s:     10111100: BC
    R,p:             10111010: BA
    q:               10111001: B9
    P:               10111000: B8
    Q:               10110110: B6
    D,O,o:           10110100: B4
    n:               10110000: B0
    b,h,d:           10101100: AC
    H:               10101000: A8
    U,u:             10100100: A4
    M,W,w:           10100011: A3
    N:               10100010: A2
    E:               10011100: 9C
    F,f:             10011000: 98
    C,c:             10010100: 94
    r:               10010000: 90
    L:               10000100: 84
    K,k:             10000011: 83
    T:               01010000: 50
    t:               01001000: 48
    J,j:             01000100: 44
    Y:               01000011: 43
    I,l,i:           01000000: 40
    Z,z:             00010101: 15
    A:               00001011: 0B
    y:               00000101: 05
    V,v,X,x:         00000011: 03
    

    This, as it stands, is too primitive for my purposes and requires more work. You may be able to use it, however, or perhaps adapt it to suit your purposes. The scheme is fairly simple. This ranking is for a mono-space font. If you are using a sans-serif font, then you likely have to re-work the values.

    This table is a hybrid table including all characters, lower- and upper-case, but if you split it into upper-case only and lower-case only it might prove more effective, and that would also allow to apply specific casing penalties.

    Keep in mind that this is early experimentation. If you see a way to improve it (for example by changing the bit-sequencing) by all means feel free to do so.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've had some success comparing strings using the PHP levenshtein function. However, for two
I'm trying to calculate the similarity (read: Levenshtein distance ) of two images, using
I'm looking at the possibility of implementing a Levenshtein distance algorithm using APARAPI, but
Using Rails 3.1. I have some pages that render same layout, except the contents
Are there examples of algorithms for determining the edit distance between 2 strings when
After running optical char recognition on some images, I get approximate text. Often the
Hey, I'm using Levenshteins algorithm to get distance between source and target string. also
I'm using levenshtein algorithm to meet these requirements: When finding a word of N
I had a table for stores containing store name and address. After some discussion,
I'm using fuzzy matching in my project mainly to find misspellings and different spellings

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.