Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 48935
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T16:21:29+00:00 2026-05-10T16:21:29+00:00

I have several sources of tables with personal data, like this: SOURCE 1 ID,

  • 0

I have several sources of tables with personal data, like this:

SOURCE 1 ID, FIRST_NAME, LAST_NAME, FIELD1, ... 1, jhon, gates ...  SOURCE 2 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 1, jon, gate ...  SOURCE 3 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 2, jhon, ballmer ... 

So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.

In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source…

Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?

What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching? Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?

Any help would be appreciated!

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T16:21:29+00:00Added an answer on May 10, 2026 at 4:21 pm

    The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.

    One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.

    If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you’ll have too few. Make it to restrictive and you’ll have too many.

    If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you’re shooting for.

    If you don’t have any numbers to use, then you’re left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it’s guess and check.

    I hope that helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a database with several hundred tables imported from a third source. Using
I have some data SELECT [field names] FROM [several joined tables] WHERE [some criteria
I have several xml files that are formated this way: <ROOT> <OBJECT> <identity> <id>123</id>
I have several UITableViews and what I would like to do is to be
I have several listviews with two collumns like. I am building a stats application
I have a class MyCLController with a property dataSource that is data source delegate
We're designing data import from an external source like MAS200 into our production SQL
I'm building an iPhone app that aggregates data from several different data sources and
I have a sql script that creates several tables, a trigger, and a trigger
I have a table called 1 Main Contacts (Main), and several other tables. They

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.