I am having trouble developing a matching algorithm in SQL. I have one table subjects. Each of these needs to be matched to the same number of rows in the table controls (for the sake of this question let’s say two rows or controls need to be selected for each subject). The location of the selected controls must match exactly, and the controls selected should have a value in match_field that is as close as possible to the subjects.
Here is some sample data:
Table subjects:
id location match_field
1 1 190
2 2 2000
3 1 100
Table controls:
id location match_field
17 1 70
11 1 180
12 1 220
13 1 240
14 1 500
15 1 600
16 1 600
10 2 30
78 2 1840
79 2 2250
Here would be the optimum result from the sample data:
subject_id control_id location match_field_diff
1 12 1 30
1 13 1 50
2 78 2 160
2 79 2 250
3 17 1 30
3 11 1 80
It gets tricky, because, for example, control 11 is the closest match to subject 1. However, in the optimum solution control 11 is matched to subject 3.
I believe the Hungarian Algorithm is close to the “correct” solution to this problem. However, there is not an equal number of subjects and controls, nor will all controls be used (I have a few thousand subjects and a few million potential controls).
It is not necessary to obtain the absolute optimum results; a pretty good approximation would be fine with me.
It seems that there should be a nice set-based solution to this problem, but I can’t think of how to do it. Here is some code that assigns an equal number of controls to each subject based on location only:
select * from (
select subject.id,
control.id,
subject.location,
row_number() over (
partition by subject.location
order by subject.id, control.id
) as rn,
count(distinct control.id) over (
partition by subject.location
) as controls_in_loc
from subjects
join controls on control.location = subject.location
)
where mod(rn,controls_in_loc+1) = 1
However, I can’t figure out how to add the fuzzy matching component. I am using DB2 but can convert an algorithm into DB2 if you are using something else.
Thanks in advance for your help!
Update: I am mostly convinced that SQL is not the right tool for this job. However, just to be sure (and because it is an interesting problem), I am offering a bounty to see if a working SQL solution is possible. It needs to be a set-based solution. It can use iteration (looping over the same query multiple times to achieve the result) but number of iterations needs to be far less than the number of rows for a large table. It should not loop over each element in the table or use cursors.
Although the Hungarian Algorithm is going to work, a much simpler algorithm can be used in your case. Your implicit cost matrix is a symmetric matrix of a special form:
Therefore, you can relatively easily prove that in an optimal assignment {SUBJi, CTRLj} ordered by
SUBJ.match_fieldthe values ofCTRL.match_fieldwill be ordered as well.Proof: Consider an assignment {SUBJi, CTRLj} ordered by
SUBJ.match_fieldthat is not ordered byCTRL.match_field. Then you have at least one inversion, i.e. a pair of assignments {SUBJi1, CTRLj1} and {SUBJi2, CTRLj2} such thatSUBJ.match_fieldi1 <SUBJ.match_fieldi2, butCTRL.match_fieldj1 >CTRL.match_fieldj2Then you can replace the inverted pair with a non-inverted one
{SUBJi1, CTRLj2} and {SUBJi2, CTRLj1}
of a cost that is less than or equal to the cost of the inverted assignment for all six relative placements of
SUBJ.match_field(i1, i2) andCTRL.match_field(j1, j2) (link to Wolfram Alpha). :ProofWith this observation in hand, it is easy to prove that the dynamic programming algorithm below comes up with the optimal assignment:
Nduplicates of each subject; order bymatch_fieldmatch_fieldassignmentsof sizeN * subject.SIZEmemof sizeN * subject.SIZEbycontrol.SIZEfor memoization; set all elements to-1Recursive_Assigndefined in pseudocode belowassignmentstable now containsNassignments for each subjectiat positions betweenN*i, inclusive, andN*(i+1), exclusive.Here is an implementation of the above pseudocode using C# on ideone.
This algorithm is ready to be re-written as set-based in SQL. Trying to fit it into the original problem setting (with grouping by locations and making multiple copies of the subject) would add unnecessary layer of complexity to a procedure that is already rather complex, so I am going to simplify things quite a bit by using table-valued parameters of SQL Server. I am not sure if DB2 provides similar capabilities, but if it does not, you should be able to replace them with temporary tables.
The stored procedure below is a nearly direct transcription of the above pseudocode into SQL Server’s syntax for stored procedures:
Here is how you call this stored procedure:
The call above assumes that both
subjectsandcontrolshave been filtered by location, and thatNcopies ofsubjectshas been inserted into the table-valued parameter (or the temp table in case of DB2) before making the call.Here is a running demo on sqlfiddle.