I want to subset a dataframe which has an ID column (v1, all unique) and a “linked” ID column (v2). The expectation of v2 is that it may contain NAs, but where it does, the corresponsing element of v1 does not appear elsewhere in v2. Also, it is expected that the relation between the columns is symmetric: where there is an entry, x, in v2 the v1 entry of that row, y, is mirrored in another row where v1 has x and v2 has y. The last criteria is that the relation is not reflexive: ie x!=y.
I want to subset the dataframe to the elements which don’t fit the expected criteria.
Here is some sample data to illustrate:
set.seed(1)
dfr <- data.frame(v1=letters,v2=rev(letters))
dfr[sample(26,10),2]<-NA
dfr[sample(26,5),2]<-sample(letters,5)
dfr
v1 v2
1 a z
2 b <NA>
3 c x
4 d w
5 e <NA>
6 f u
7 g <NA>
8 h s
9 i i
10 j <NA>
11 k p
12 l <NA>
13 m f
14 n <NA>
15 o l
16 p k
17 q j
18 r e
19 s <NA>
20 t g
21 u <NA>
22 v e
23 w <NA>
24 x q
25 y x
26 z a
So rows 1, 2, 11, 14, 16, and 26 all meet the criteria, and I want to identify the rest.
I have attempted some solutions using match, but the NAs are causing problems. It also probably relies on the fact that in this case v2 is based on rev(v1), whereas a more general solution can’t make that assumption.
If I correctly understand, here is an example: