I have a data frame the first columns of which are a sample ID number and then a well position, like so:
>df[1:12,1:10]
S W V3 V4
SID1 A01 <NA> <NA>
SID2 A02 <NA> <NA>
SID3 A03 <NA> <NA>
SID4 A01 <NA> <NA>
SID5 A02 <NA> <NA>
SID5 A03 <NA> <NA>
the combination of the S and W columns are unique, and must remain so, as some samples have repeated measures, but for downstream analysis reasons (not in R) cannot be placed on the same row as is usual.
I wish to insert data into the data frame based on the unique combination of these two columns.
The data I am trying to insert is from another data frame and looks like this:
>results[1:12, 1:4]
SampleID Value Assay Well
SID1 0 V3 A01
SID1 0 V4 A01
SID2 1 V3 A02
SID2 2 V4 A02
SID3 0 V3 A03
SID3 1 V4 A03
SID4 0 V3 A01
SID4 0 V4 A01
SID5 1 V3 A02
SID5 2 V4 A02
SID6 0 V3 A03
SID6 1 V4 A03
so currently I am looping through the columns (V3 and V4, there are about 1000 columns in the real data set) and inserting the data for each column, one at a time based on the unique combination of sample id, well position and assay. This is slow. I want to vectorise this to make it faster by inserting all the values for V3 at the same time, based on sample id and well.
I tried
for(i in levels(result$Assay))
{
df$V3[(df$V1 %in% results$SampleID)&(df$V2 %in% results$Well]
= results$Value[results$Assay==i]
}
This doesn’t work for me. I imagine because of something stupid on my part!
Any ideas?
EDIT:
Actually, Ben’s solution only almost worked. Everythings goes fine at first, but because the Assays are spread out over n files, and the samples are spread out over y files when merge tries to join the two dfs with an assay it’s already merged into df, it adds a new column and appends a “.1” onto the end.
Exactly what you’d expect merge to do I suppose. My fault for not explaining that my data is coming from separate files.
to illustrate:
I have 16 files. There 1536 samples spread out over 4 files, 384 each. There are 160 separate assays, spread out over 4 assay bundles. To run every assay for every sample I end up with 16 files.
So if I can get merge to not add a new column if the column for the current assay is already there, that would be perfect.
All suggestions are welcome,
and sorry for being crap at explaining my data!
Cheers
Davy
Let’s suppose you have the file names in a vector
datafilessuch that files 1-4 are the data for all assays for samples 1-384, 5-8 for all assays for samples 385-768, and so on, and that you want to end up with a data frame that is 1536 rows by 162 columns.Split into four chunks:
A function to take a list of
ndata sets, each containingmassays fromkindividuals (i.e. each one isk*mrows by 4 columns:SampleID,Well,Assay,Value) and combine them into a single data set that iskrows byn*m+2columns long:Now apply this to each of the chunks:
Now combine the chunks:
I’m not sure this will work, but it might. You should take the pieces apart and examine what they do separately if it doesn’t work on the first try — I may have screwed up somewhere.