Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the “kimbal” style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn’t exactly match, we can score based on other things like zip code radius.
Here’s the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That’s all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we’re confident it’s the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a “change” from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
Then have a second table that contains your cleaned data:
Then have a mapping table between the two.
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what
last_nameto use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.