The problem
We have a table of duplicate customer numbers:
A varchar(16) NOT NULL,
B varchar(16) NOT NULL
These columns started off as Old and New (Delete and Retain), but devolved to where neither is preferred. The columns really are just “A” and “B” — two numbers for the same customer, in any order.
Furthermore, the table can have an arbitrary number of pairs for the same customer. You might see rows like
a,b
b,c
meaning a,b,c are all for the same customer. You might also see rows like
a,b
b,a
c,a
meaning a,b,c are all the same customer.
It’s not a clean acyclic representation like “old” and “new” values. The list of customer IDs for a customer is represented in this table in chunks of one or more rows, where the only connection is that the value for A or B column in one row might show up in the A or B column in some other row. My mission is to tie them all together into the list for each customer.
I want to convert this mess to something like
MasterKey int NOT NULL,
CustNum varchar(16) NOT NULL UNIQUE,
PRIMARY KEY( MasterKey, CustNum )
The one or more numbers for a customer would share the MasterKey in this table. As the UNIQUE constraint says, a given CustNum can’t appear more than once.
So for example, rows like this from the original
1a,1b
1b,1c
2a,2b
2b,2c
2d,2a
...
should end up as rows like this in the new table
1 1a
1 1b
1 1c
2 2a
2 2b
2 2c
2 2d
...
Edit: The values above are just to make the pattern clear. The actual customer number values are arbitrary varchars.
My attempted solutions
This feels like a job for recursion and therefore a CTE. But the potentially cyclic nature of the source data makes it hard for me to get the anchor case. I’ve tried to pre-clean it into more of an acyclic form, but I still can’t seem to get this right.
I’m also stubbornly trying to do this as a set-based SQL operation, instead of resorting to a cursor and loop. But maybe that’s not possible.
I’ve spent a good 8 hours pondering this and trying different approaches but it keeps slipping away. Any ideas or suggestions on the correct approach, or even some example code?
I’m going to do something I haven’t done before, and post an answer to
my own question. I need to give huge thanks to both Beth and JBrooks
for moving me in the right direction. I really wanted to solve this
in a set-based, declarative way. And maybe that’s still possible using
a CTE and recursion. But once I surrendered and said it’s OK for it to
be imperative and iterative, it was much easier to do it.
Anyway, given this target table from my question:
I came up with the following stored procedure. It can be called when
new dupes are reported, one by one. It can also be called in a loop
over the legacy table that stores the dupes as pairs in a random
order.