I’m trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don’t know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What’s more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn’t. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it’s better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I’d be very grateful for any help or guidance.
If you are looking for which things in
x$idare iny$id2, then you can useto get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
to see what rows of
xhave a corresponding ID iny.To see which observations are 1-to-1, you could do something like
to filter out elements that appear more than once in
y$id2. You can also add this tox:The same procedure can be done for
yto determine what rows ofymatch inx, and which ones match uniquely.