Let’s say a and b are two data frames. The goal is to write a function
f(a,b) that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
rep(NA, nrow(x)) if y$var !is.factor, or else
as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here’s a “naive” implementation:
merge.datasets1 <- function(a, b) {
a.fill <- rep(NA, nrow(a))
b.fill <- rep(NA, nrow(b))
a.fill.factor <- as.factor(a.fill)
b.fill.factor <- as.factor(b.fill)
out <- list()
for (v in union(names(a), names(b))) {
if (!v %in% names(a)) {
b.srcvar <- b[[v]]
if (is.factor(b.srcvar))
a.srcvar <- a.fill.factor
else
a.srcvar <- a.fill
} else {
a.srcvar <- a[[v]]
if (v %in% names(b))
b.srcvar <- b[[v]]
else if (is.factor(a.srcvar))
b.srcvar <- b.fill.factor
else
b.srcvar <- b.fill
}
out[[v]] <- unlist(list(a.srcvar, b.srcvar),
recursive=FALSE, use.names=FALSE)
}
data.frame(out)
}
Here’s a different implementation that uses “vectorized” functions:
merge.datasets2 <- function(a, b) {
srcvar <- within(list(var=union(names(a), names(b))), {
a.exists <- var %in% names(a)
b.exists <- var %in% names(b)
a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
})
a <- within(a, {
fill <- NA
fill.factor <- factor(fill)
})
b <- within(b, {
fill <- NA
fill.factor <- factor(fill)
})
out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
recursive=FALSE, use.names=FALSE),
srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
out <- data.frame(out)
names(out) <- srcvar$var
out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>> user system elapsed
>> 0.192 0.000 0.190
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>> user system elapsed
>> 2.292 0.000 2.293
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for loops are slow, and that one should
rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.
In fact, you are not doing trying to replicate
merge(a,b, all = TRUE)at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling withNAwhere a column does not exist.The reason
merge(a,b, all = TRUE)will be slow is that it defaults to merging by the intersection of the names. If you convert todata.tablesthen themerge.data.tablemethod is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)An easy solution is to use
rbind.fillfrom theplyrpackage.EDIT 2-11-2012
On further consideration it is really useful to note that you can pass a list of
data.framestorbind.fillsoThe majority of time spent in the overheads for
Reduce, whichrbind.filldoes not require.