I have a table with the following structure:
WorkerPersons
-------------------------------
ID (PK)
PersonID (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------
As you can see, it’s a denormalized table, which holds both Worker and Person (and many versions of one Person) data in one table. My wish is to normalize that table, however, as the table holds a lot of data (many many columns), I need to be sure which columns should go to Workers table and which columns to Persons table. The outcome should be like this:
Workers Persons
----------------------- ---------------------
ID ID
PersonID (now a FK) PersonColumn1
WorkerColumn1 PersonColumn2
WorkerColumn2 ...
... PersonColumnN
WorkerColumnN
----------------------- ---------------------
To do that, I need to analyze which data differs in scope of Person over all unique Persons (wich are separated by PersonID in WorkerPersons). For example:
WorkerPersons
-------------------------------------------------------
ID PersonID Column1 Column2 Column3
-------------------------------------------------------
1 PersonA 10.1 John Doe Single
2 PersonA 10.1 John Doe Single
3 PersonA 10.1 John Doe Married
4 PersonB 09.2 Sully Single
5 PersonB 09.2 Sullivan Single
In this case, there are 3 versions on PersonA and 2 versions of PersonB. Column1 values are always the same over all versions of Person, and we can move that column to table Worker. But Column 2 and Column3 values change over different versions of Person, so those values should be moved to Person table.
No imagine, I have about 10 tables like this that need to be normalized, with about 40 columns in each. Eeach table holds about 500k to 5m rows.
I need a script that helps me analyse which columns to move where. I need a script that outputs all columns that change in scope of unique Person over the whole table. I’ve no ideas however how to do that. I experimented with LAG analytical function to compare against the next row but how in the world to output changed columns is beyond me.
Please advise.
Best wishes,
Andrew
Since 10 tables is not a lot, here is (some sort of) pseudo code
with information schema and dynamic queries you could make the above proper PL/SQL or take the core query and script it in your favourite language.
EDIT:
The above assumes no
NULLs incolumn_name.EDIT2:
Other variants of the core query can be
Which will return a row if all values per PersonID are the same.
(this query also has problems with NULLS, but I consider NULLs a separate issue; you can always cast a NULL to some out-of-domain value for purposes of the above query)
The above query can be also written as
Which will test multiple columns at the same time returning true for columns that belong to ‘Workers’ and false for columns that should go into ‘Persons’.