I have to create unique combination while allowing some mismatches. The following is an example:
set.seed (1234)
dataf <- data.frame (var1 = sample( c("A", "B", "-"),20, replace = T),
var2 = sample( c("A"),20, replace = T),
var3 = sample( c("B", "B", "B", "-"),20, replace = T),
var4 = sample( c("A","A", "A", "-"),20, replace = T),
var5 = sample( c("A", "B", "A", "A", "-"),20, replace = T)
)
dataf
Rules:
(1) Generate Unique combinations:
A B A B B - combination 1
A A A B B - combination 2
B B B A A - combination 3
so on ...
(2) Allow one (can be n) mismatch to create a category. For example:
A B A B B
A A A B B
B A A B B
B A B B B
B A A B A
are same as there a single mismatch at different variables.
(3) “-” indicates missing values, can be treated as similar way as integers in matching means that one mismatch allowed.
A B A B B
A - A B B
A B A - B
However if there are two missing values then combination is declared unknown (-)
A B A B B
A - A - B
A B A - -
The following is workout for the above data.
var1 var2 var3 var4 var5 comb
1 A A B - - -
2 B A B A A 1
3 B A B A A 1
4 B A B A A 1
5 - A B A A 1
6 B A B A - 1
7 A A B A B 2
8 A A B A B 2
9 B A B A A 1
10 B A - A - -
11 - A B A A 1
12 B A B - - -
13 A A B A A 2
14 - A B - A -
15 A A B A A 2
16 - A B A A 2
17 A A B A B 2
18 A A - A A 3
19 A A B A B 2
20 A A - A A 3
Any idea ?
Here is how I would do it. The idea is create a distance matrix, so you can cluster your data into groups of rows that have a zero distance among them.
First, let’s remove (temporarily) the rows that have two or more dashes:
Then, let’s compute a distance matrix.
Then, let’s use clustering to group your data. I am using a complete hierarchical clustering and cutting it at
height = 0, i.e., it creates groups of points that all have a distance of zero among them.Let’s put everything together:
This is exposing contradictions in your expected output. For example, row 7 and 13 should not belong to the same group. Also, there are rows with a single dash that could go to different groups, e.g. row 16.