I am trying to build a database in R from multiple csvs. There are

Question

0

Asked: June 16, 20262026-06-16T02:10:20+00:00 2026-06-16T02:10:20+00:00

I am trying to build a database in R from multiple csvs. There are

0

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):

d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)

I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can’t figure out a good way to do it. Thanks in advance.

To be clear, I am aiming for a final database like this:
  common species id
1      a     A.a  1
2      b     B.b  2
3      c     C.c  3
4      d     D.d  4
5      e     E.e  5

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T02:10:21+00:00

I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.

#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)

#function to get the first non NA result
getfirstnonna <- function(x){
  ret <- head(x[which(!is.na(x))],1)
  ret <- ifelse(is.null(ret),NA,ret)
  return(ret)
}

#function to get max info based on one variable
runiteration <- function(dataset,variable){
  require(plyr)
  e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
  #returns the above without the NA "factor"
  return(e[which(!is.na(e[ ,variable])), ])
}

#run through all variables
for(i in 1:length(names(d))){
  d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])

If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to build a database in R from multiple csvs. There are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply