Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8758547
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T14:34:38+00:00 2026-06-13T14:34:38+00:00

Let’s say a and b are two data frames. The goal is to write

  • 0

Let’s say a and b are two data frames. The goal is to write a function
f(a,b) that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)

This can be done as follows (pseudo-code):

for each variable `var` found in either `a` or `b`, do:
    unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)

where:
x.srcvar is x$var if x$var exists, or else
            rep(NA, nrow(x)) if y$var !is.factor, or else
            as.factor(rep(NA, nrow(x)))

and then wrap everything in a data frame.

Here’s a “naive” implementation:

merge.datasets1 <- function(a, b) {
  a.fill <- rep(NA, nrow(a))
  b.fill <- rep(NA, nrow(b))
  a.fill.factor <- as.factor(a.fill)
  b.fill.factor <- as.factor(b.fill)
  out <- list()
  for (v in union(names(a), names(b))) {
    if (!v %in% names(a)) {
      b.srcvar <- b[[v]]
      if (is.factor(b.srcvar))
        a.srcvar <- a.fill.factor
      else
        a.srcvar <- a.fill
    } else {
      a.srcvar <- a[[v]]
      if (v %in% names(b))
        b.srcvar <- b[[v]]
      else if (is.factor(a.srcvar))
        b.srcvar <- b.fill.factor
      else
        b.srcvar <- b.fill
    }
    out[[v]] <- unlist(list(a.srcvar, b.srcvar),
                       recursive=FALSE, use.names=FALSE)
  }
  data.frame(out)
}

Here’s a different implementation that uses “vectorized” functions:

merge.datasets2 <- function(a, b) {
  srcvar <- within(list(var=union(names(a), names(b))), {
    a.exists <- var %in% names(a)
    b.exists <- var %in% names(b)
    a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
    b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
    a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
    b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
  })
  a <- within(a, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  b <- within(b, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
                                     recursive=FALSE, use.names=FALSE),
                srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
  out <- data.frame(out)
  names(out) <- srcvar$var
  out
}

Now we can test:

sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])

system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>>   user  system elapsed 
>>  0.192   0.000   0.190 
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>>   user  system elapsed 
>>  2.292   0.000   2.293 

So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for loops are slow, and that one should
rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T14:34:40+00:00Added an answer on June 13, 2026 at 2:34 pm

    In fact, you are not doing trying to replicate merge(a,b, all = TRUE) at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA where a column does not exist.

     # note  that this is not what you want/
    dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
     [1] 314   5
    

    The reason merge(a,b, all = TRUE) will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables then the merge.data.table method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)

    An easy solution is to use rbind.fill from the plyr package.

    library(plyr)
    system.time({.x <- Reduce(rbind.fill, sample.datasets)})
    ## user  system elapsed 
    ## 0.16    0.00    0.15 
    # which is almost identical to
    system.time(.old <- Reduce(merge.datasets1, sample.datasets))
    ##   user  system elapsed 
    ##   0.14    0.00    0.14 
    

    EDIT 2-11-2012

    On further consideration it is really useful to note that you can pass a list of data.frames to rbind.fill so

     system.time(super_fast <- rbind.fill(sample.datasets))
     ##  user  system elapsed 
     ##  0.02    0.00    0.02 
    
    identical(super_fast, .old)
    [1] TRUE
    

    The majority of time spent in the overheads for Reduce, which rbind.fill does not require.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let me explain best with an example. Say you have node class that can
Let's say that I have a SQLite database that I create in a separate
Let's say I have multiple requirements for a password. The first is that the
Let's say that I have a date in R and it's formatted as follows.
Let assume we have two activities. A - main activity, that is home launcher
Let's say you have a method that expects a numerical value as an argument.
Let's say I have a bunch of links that share a click event: <a
Let's say I have two tables orgs and states orgs is (o_ID, state_abbr) and
Let's say I want to write an FTP client from scratch. In the command
Let's say that I've got a sheet - number one - with over 5000

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.