Can anyone help with a SQL problem Im having whereby I need to merge n number of rows into one record. The individual records may or may not have fields populated that others do.
Basically I have an issue where duplicate records have been created in SQL. Some contain information others dont. I need to merge them (I can rank them), updating a field if the value doesn’t exist in the former record (starting with highest ranked first).
For instance, if I have two user records, one has last name populated, the other has first name. These are duplicates and need to be merged into one record, like a coalesce. However, the are n number of rows.
Its essentially a transpose of many records into one, where a field is updated only if a duplicate record lower in rank has that field populated and the field doesn’t exist in the higher ranked row.
Here is a very simplified version of the issue. As you can see, using SQL Fiddle the script creates 6 records. These records should be merged into 2 records and all the fields filled in.
The problem is, there could be x number of rows. I cannot use a coalesce statement as there is variance to the number of rows.
Hope that make sense?
CREATE TABLE [dbo].[Employee]([EmployeeId] varchar(10) NULL,
[First Name] [varchar](30) NULL,
[Middle Name] [varchar](30) NOT NULL,
[Last Name] [varchar](30) NOT NULL,
[E-Mail] [varchar](80) NOT NULL)
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','Bob','','','bob@hotmail.com');
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','','John','','bob@hotmail.com');
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','','','Smith','bob@hotmail.com');
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','','Peter','','mark@hotmail.com');
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','Mark','','','mark@hotmail.com');
insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','','','Davis','mark@hotmail.com');
select * from [Employee]
Hope that makes sense.
Thanks
If the performance is important enough to justify a couple of hours of coding and you are allowed to use SQLCLR, you can calculate all values in a single table scan with multi-parameter User Defined Aggregare.
Here’s an example of an aggregate that returns lowest-ranked non-
NULLstring:Assuming your table looks something like this:
CREATE TABLE TopNonNullRank (
Id INT NOT NULL,
UserId NVARCHAR (32) NOT NULL,
Value1 NVARCHAR (128) NULL,
Value2 NVARCHAR (128) NULL,
Value3 NVARCHAR (128) NULL,
Value4 NVARCHAR (128) NULL,
PRIMARY KEY CLUSTERED (Id ASC)
);
The following simple query returns top non-
NULLvalue for each column.The only thing left is merging the results back to the original table. The simplest way would be something like this:
Note that this update still leaves you with duplicate rows, and you would need to get rid of them.
You could also get more fancy and store the results of this query into a temporary table, and then use
MERGEstatement to apply them to the original table.Another option would be to store the results in a new table, and then swap it with the original table using
sp_renamestored proc.