Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 776561
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T19:28:09+00:00 2026-05-14T19:28:09+00:00

I’m a python n00b and I’d like some suggestions on how to improve the

  • 0

I’m a python n00b and I’d like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names.

def winklerCompareP(str1, str2):
"""Return approximate string comparator measure (between 0.0 and 1.0)

USAGE:
  score = winkler(str1, str2)

ARGUMENTS:
  str1  The first string
  str2  The second string

DESCRIPTION:
  As described in 'An Application of the Fellegi-Sunter Model of
  Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler
  and Yves Thibaudeau.

  Based on the 'jaro' string comparator, but modifies it according to whether
  the first few characters are the same or not.
"""

# Quick check if the strings are the same - - - - - - - - - - - - - - - - - -
#
jaro_winkler_marker_char = chr(1)
if (str1 == str2):
    return 1.0

len1 = len(str1)
len2 = len(str2)
halflen = max(len1,len2) / 2 - 1

ass1  = ''  # Characters assigned in str1
ass2  = '' # Characters assigned in str2
#ass1 = ''
#ass2 = ''
workstr1 = str1
workstr2 = str2

common1 = 0    # Number of common characters
common2 = 0

#print "'len1', str1[i], start, end, index, ass1, workstr2, common1"
# Analyse the first string    - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len1):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len2)
    index = workstr2.find(str1[i],start,end)
    #print 'len1', str1[i], start, end, index, ass1, workstr2, common1
    if (index > -1):    # Found common character
        common1 += 1
        #ass1 += str1[i]
        ass1 = ass1 + str1[i]
        workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:]
#print "str1 analyse result", ass1, common1

#print "str1 analyse result", ass1, common1
# Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len2):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len1)
    index = workstr1.find(str2[i],start,end)
    #print 'len2', str2[i], start, end, index, ass1, workstr1, common2
    if (index > -1):    # Found common character
        common2 += 1
        #ass2 += str2[i]
        ass2 = ass2 + str2[i]
        workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:]

if (common1 != common2):
    print('Winkler: Wrong common values for strings "%s" and "%s"' % \
                (str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \
                ', common should be the same.')
    common1 = float(common1+common2) / 2.0    ##### This is just a fix #####

if (common1 == 0):
    return 0.0

# Compute number of transpositions    - - - - - - - - - - - - - - - - - - - - -
#
transposition = 0
for i in range(len(ass1)):
    if (ass1[i] != ass2[i]):
        transposition += 1
transposition = transposition / 2.0

# Now compute how many characters are common at beginning - - - - - - - - - -
#
minlen = min(len1,len2)
for same in range(minlen+1):
    if (str1[:same] != str2[:same]):
        break
same -= 1
if (same > 4):
    same = 4

common1 = float(common1)
w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1)

wn = w + same*0.1 * (1.0 - w)
return wn

Example output

ZIMMERMANN  ARMIENTO    0.814583333
ZIMMERMANN  ZIMMERMANN  1
ZIMMERMANN  CANNONS         0.766666667
CANNONS AKKER           0.8
CANNONS ALDERSON    0.845833333
CANNONS ALLANBY         0.833333333
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T19:28:10+00:00Added an answer on May 14, 2026 at 7:28 pm

    I focused more on optimizing to get more out of Python than on optimizing the algorithm because I don’t think that there is much of an algorithmic improvement to be had here. Here are some Python optimizations that I came up with.

    (1). Since you appear to be using Python 2.x, change all range()’s to xrange()’s. range() generates the full list of numbers before iterating over them while xrange generates them as needed.

    (2). Make the following substitutions for max and min:

    start = max(0,i-halflen)
    

    with

    start = i - halflen if i > halflen else 0
    

    and

    end = min(i+halflen+1,len2)
    

    with

    end = i+halflen+1 if i+halflen+1 < len2 else len2
    

    in the first loop and similar ones for the second loop. There’s also another min() farther down and a max() near the beginning of the function so do the same with those. Replacing the min()’s and max()’s really helped to reduce the time. These are convenient functions, but more costly than the method I’ve replaced them with.

    (3). Use common1 instead of len(ass1). You’ve kept track of the length of ass1 in common1 so let’s use it rather than calling a costly function to find it again.

    (4). Replace the following code:

    minlen = min(len1,len2)
    for same in xrange(minlen+1):
        if (str1[:same] != str2[:same]):
            break
    same -= 1
    

    with

    for same in xrange(minlen):
        if str1[same] != str2[same]:
            break
    

    The reason for this is mainly that str1[:same] creates a new string every time through the loop and you will be checking parts that you’ve already checked. Also, there’s no need to check if '' != '' and decrement same afterwards if we don’t have to.

    (5). Use psyco, a just-in-time compiler of sorts. Once you’ve downloaded it and installed it, just add the lines

    import psyco
    psyco.full()
    

    at the top of the file to use it. Don’t use psyco unless you do the other changes that I’ve mentioned. For some reason, when I ran it on your original code it actually slowed it down.

    Using timeit, I found that I was getting a decrease in time of about 20% or so with the first 4 changes. However, when I add psyco along with those changes, the code is about 3x to 4x faster than the original.

    If you want more speed

    A fair amount of the remaining time is in the string’s find() method. I decided to try replacing this with my own. For the first loop, I replaced

    index = workstr2.find(str1[i],start,end)
    

    with

    index = -1
    for j in xrange(start,end):
        if workstr2[j] == str1[i]:
            index = j
            break
    

    and a similar form for the second loop. Without psyco, this slows down the code, but with psyco, it speeds it up quite a lot. With this final change the code is about 8x to 9x faster than the original.

    If that isn’t fast enough

    Then you should probably turn to making a C module.

    Good luck!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.