I have two sets of data in this form:
x | y | z x1 | y1 | z1
ab1 | 1 | 2 ab1 | 1 | 2
ab1 | 2 | 3 ab1 | 1.8 | 2
ab2 | 2 | 3 ab1 | 1.8 | 2
The number of columns can change between 1 to 30. The number of rows of the two sets is likely to be different.
The average amount of rows per set can change between few hundreds to few millions.
For each column a different matching rule will be applied, for example:
x: perfect match
y: +/- 0.1
z: +/- 0.5
Two rows are equivalent when all the criterias are satisfied.
My final goal is to find the rows in the first set with no match in second set.
The naive algorithm could be:
foreach a in SetA
{
foreach b in SetB
{
if (a == b)
{
remove b from SetB
process the next element in SetA
}
}
log a is not in SetB
}
At this stage I am not very interested in the efficiency of the algorithm. I am sure I could do better and I could reduce the complexity.
I am more concern about the correctness of the result. Let’s try with a very simple example.
Two sets of number:
A B
1.6 1.55
1.5 1.45
4 3.2
And two elements are equal if:
b + 0.1 >= a >= b - 0.1
Now, if I run the naive algorithm I will find 2 matches.
However the result of the algorithm depends on the order of the two sets. For example:
A B
1.5 1.55
1.6 1.45
4 3.2
The algorithm will find only one match.
I would like to find the maximum number of matching rows.
I reckon in the real world data one of the columns will store an id, so the number of possible multiple matches will be a much smaller subset of the original set.
I know I can try to face this problem with a post processing after the first scan.
However, I don’t want reinventing the wheel and I am wondering if my problem is equivalent to some famous, well known and already solved problem.
PS: I have tagged the question also as C++, C# and Java because I am going to use one of these languages to implement it.
It can be cast as a graph theory problem. Let X be a set that contains one node for each row in your first set. Let Y be another set which contains one node for each row in your second set.
The edges in the graph are defined by: for a given x in X and a given y in Y, there is an edge (x,y) if the row corresponding to x matches the row corresponding to y.
Once you have built this graph you can run the “maximum-bipartite-matching” algorithm on it and you will be done.