I’m trying to join two datasets together. Call them x and y. I believe

Question

0

Asked: June 16, 20262026-06-16T21:26:11+00:00 2026-06-16T21:26:11+00:00

I’m trying to join two datasets together. Call them x and y. I believe

0

I’m trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don’t know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.

My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:

id <- c(1:10, 1:100)

X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002") 
year <- rep(year, 22)

month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)

#dataset X
x <- cbind(id, X1, month, year)

#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)

#merge on the IDs; but we get an error because when id2 == 200 in y we don't 
#have a match in x 
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)

The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):

tail(result) 
      id                   X1 month year         Y1
106   95  -0.0748386054887876   Nov 2002         NA
107   96    0.196765325477989   Dec 2004         NA
108   97    0.527922135906927   Jan 2005         NA
109   98    0.197927230533413   Feb 2006         NA
110   99 -0.00720474886698309   Mar 2001         NA
111 <NA>                 <NA>  <NA> <NA> -0.9664941

What’s more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).

head(result)
   id                 X1 month year       Y1
1   1  -0.67371266313441   Jul 2004 1.553220
2   1 -0.318666983469993   Jul 2004 1.553220
3  10 -0.608192898092431   Apr 2002 1.234325
4  10  -0.72299929212347   Apr 2002 1.234325
5 100 -0.842111221826554   Apr 2002       NA
6  11  -0.16316681842082   Jul 2004       NA

This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn’t. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.

I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it’s better to wring a loop or function to do something along these lines:

for every observation in x

id2 = which(id2) corresponds to id-month-year

flag = 1 if length of above is == 1, 0 otherwise

etc.

Hopefully this all makes sense. I’d be very grateful for any help or guidance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T21:26:12+00:00

If you are looking for which things in x$id are in y$id2, then you can use

x$id %in% y$id2

to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame

x$match.y <- x$id %in% y$id2

to see what rows of x have a corresponding ID in y.

To see which observations are 1-to-1, you could do something like

y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])

to filter out elements that appear more than once in y$id2. You can also add this to x:

x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])

The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to join two datasets together. Call them x and y. I believe

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply