Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 147469

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T08:47:13+00:00 2026-05-11T08:47:13+00:00

I am maintaining a data warehouse with multiple sources of data about a class

  • 0

I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.

In other words, if source system A has natural key ABC representing the same entity as source system B’s natural key DEF, we would assign the same surrogate key to both. The table would look like this:

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY  1               ABC                     DEF 

That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A’s natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY  1               ABC                     NULL  2               ABC                     DEF 

So the warehouse is a mess. There’s much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.

A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:

Wikipedia – Matching

MIT 18.433 Combinatorial Optimization – Lecture Notes on Non-Bipartite Matching

I need an easy to understand implementation (not optimally performing) of Edmond’s paths, trees, and flowers algorithm. I don’t have a formal math or CS background, and what I do have is self-taught, and I’m not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.

EDIT:

A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C…) paints you into a local maxima corner.

In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I’m helping them with functions to assess global match quality. This is ideal since they’re the ones signing off anyways, so my backside is covered.

Not using surrogate keys doesn’t change the matching problem. There’s still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.

  • 0 0 Answers
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T08:47:14+00:00Added an answer on May 11, 2026 at 8:47 am

    I get the impression you’re going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both ABC and DEF identify the same record is disastrous, but ultimately repairable. I’m not even sure why you need surrogate keys at all; while they do have many advantages, I’d give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that’s a decision that would have to be made after looking at your whole schema.

    To address your potential solution, I’ve pulled out my copy of D. B. West’s Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You’ll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it’s sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you’re not finding what you need.

    3.3.17. Algorithm. (Edmonds’ Blossom Algorithm [1965a]—sketch).

    Input. A graph G, a matching M in G, an M-unsaturated vertex u.

    Idea. Explore M-alternating paths from u, recording for each vertex the vertex from which it was reached, and contracting blossoms when found. Maintain sets S and T analogous to those in Algorithm 3.2.1, with S consisting of u and the vertices reached along saturated edges. Reaching an unsaturated vertex yields an augmentation.

    Initialization. S = {u} and T = {} (empty set).

    Iteration. If S has no unmarked vertex, stop; there is no M-augmenting path from u. Otherwise, select an unmarked v in S. To explore from v, successively consider each y in N(v) such that y is not in T.

    If y is unsaturated by m, then trace back from y (expanding blossoms as needed) to report an M-augmenting (u, y)-path.

    If y is in S, then a blossom has been found. Suspend the exploration of v and contract the blossom, replacing its vertices in S and T by a single new vertex in S. Continue the search from this vertex in the smaller graph.

    Otherwise, y is matched to some w by M. Include y in T (reached from v), and include w in S (reached from y).

    After exploring all such neighbors of v, mark v and iterate.

    The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds’ original paper, just ask and I’ll dig them out of the index (which kind of sucks in this book).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 193k
  • Answers 193k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Set up Your BOOST_ROOT environment variable first: winXP: set BOOST_ROOT=D:\your\boost\sources… May 12, 2026 at 6:34 pm
  • Editorial Team
    Editorial Team added an answer You can slightly modify your code to the following [ServiceBehavior]… May 12, 2026 at 6:34 pm
  • Editorial Team
    Editorial Team added an answer I'm not sure if this is what you are looking… May 12, 2026 at 6:34 pm

Related Questions

I am maintaining a simple php-based in-house cms. I'd like to search the text
Before I begin, I must preface by stating that I am a novice when
I had a heated discussion with a colleague on the usage of stored procedures
What is the best way to save data in session variables in a classic

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.