I have a three dataframes, and I want to add some columns to the first dataframe which counts the number of times the first two columns in the first dataframe appear in the other dataframes e.g.
dataframe – x
a b
1 1
1 2
2 1
2 2
dataframe – y
a b
1 1
1 1
1 2
2 2
2 2
dataframe – z
a b
1 2
2 1
2 1
2 2
So the first dataframe would become
a b y z
1 1 2 0
1 2 1 1
2 1 0 2
2 2 2 1
I have ways to do this, e.g. I am currently doing
x$y<- sapply(1:nrow(x), function(i){
sum(y$a == x$a[i] & y$b == x$b[i])
}
x$z<- sapply(1:nrow(x), function(i){
sum(z$a == x$a[i] & z$b == x$b[i])
}
But my dataframe is very large and my way takes a while to complete so I was wondering of the quickest way to do this.
Please ask if anything is unclear.
Thanks in advance
To avoid the double loop, I would use the function match, which is optimized for finding elements in another list. To count how many elements, I propose to tabulate the variables first, and then to match against the table.
My guess is that it would significantly reduce the time complexity, because the method you propose is quadratic (one loop goes over x rows and for each an inner loop goes over y rows) whereas the functions match and table are based on sorts (I think) which are rather n*log(n).
We first turn the data frames into vectors with paste, taken from the answer of Josh:
Then we tabulate and match against the tabluation.
You could put table(Y) in an intermediate variable if you want to avoid tabulating two times.