Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1091515
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T23:32:27+00:00 2026-05-16T23:32:27+00:00

I am writing some code to find duplicate customer details in a database. I’ll

  • 0

I am writing some code to find duplicate customer details in a database. I’ll be using Levenshtein distance.

However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction.

What confuses me is how to store the bidirectional nature of the relationship.

I’ve started to put some examples below, but wondered if there is a best practice for storing this type of data,

Example data

id, address

001, 5 Main Street
002, 5 Main St.
003, 5 Main Str
004, 6 High Street
005, 7 Low Street
006, 7 Low St

Suggestion 1

customer_id1, customer_id2, relationship_strength
001, 002, 0.74
001, 003, 0.77
002, 003, 0.76
005, 006, 0.77

Not happy with this approach as it sort of infers a one way relationship between customer_id1 to customer_id2. Unless of course I include all relationships both ways, but that would double the amount of processing time and the size of the tables.

eg would need to include: 002, 001, 0.74

Suggestion 2

customer_id, grouping_id
001, 1
002, 1
003, 1
005, 2
006, 2

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T23:32:28+00:00Added an answer on May 16, 2026 at 11:32 pm

    What we have here is a graph in which each node has a relationship (edit distance) with every other node. This is not in the normal range of data models. It is also not a permanent feature of your database (assuming you resolve the business processes which led to the duplicate data) so it isn’t worth sweating over the solution which best fits relational theory. What we need is a a practical solution.

    Think of it as a matrix. If we go for the optimum processing we won’t execute the duplicate scorings. So we score Address 1 against all the other Addresses, we score Address 2 against all the other Addresses except Address 1, we score Address 3 against all the other Addresses except Addresses 1 and 2, etc. And what we end up with is a bit like a football league table:

              addr  
              1    2     3    4     5
    addr
      1       -   95    95   80    76 
      2       -    -   100   75    72
      3       -    -     -   75    72
      4       -    -     -    -    83
      5       -    -     -    -     -
    

    This data can best be stored in suggestion 1, a table of ID1, ID2, SCORE. Although we do need to pivot the data to get the output looking like that 🙂

    In a proper league table there are two sets of scores – Home and Away – so the table is symmetrical. But that doesn’t apply here, as the edit distance for 1 > 2 is the same as 2 > 1. However, it would make querying the results more straightforward if the result set included the mirrored scores. That is, for records (1,5,76), (2,5,72), etc we generate records (5,1,76), (5,2,72). This could be done at the end of the scoring process.

              addr  
              1    2     3    4     5
    addr
      1       -   95    95   80    76 
      2      95    -   100   75    72
      3      95  100     -   75    72
      4      80   75    75    -    83
      5      76   72    72   83     -
    

    Of course, this is mainly a presentational thing, so it only needs to be done for display purposes, e.g. exporting the data to a spreadsheet. We can still get all the scores for, say, Address 5 in a readable fashion without miiroring the scores using a simple SQL statement:

    select case when id1 = 5 then id1 else id2 end as id1
           , case when id1 = 5 then id2 else id1 end as id2 
           , score
    from   your_table
    where  id1 = 5 
    or     id2 = 5
    /
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.