I have a for loop that is awfully slow and doesnt work proper, it looks in 1 data.frame for a barcode and than searches for that barcode in another data.frame. The bar_code of the 2nd data.frame can be there multiple times. Every time it finds a barcode a counter should count the amount of times the barcode is there and write the number of barcodes to the 1st data frame.
My try:
for(i in 1:length(tcgadataUniek$Tumor_Sample_Barcode)){
for(j in 1:length(hprdDataSorted$Samples.Int1)){
count<-0
if(i==j){
count<-count+1
} else {
count<-count+0
}
hprdDataSorted$Samples.Int2<-count[j]
}
}
1st Data.Frame looks as follows (csv):
HUGO.Int1,HUGO.Int2,barcode.Int1
A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09
A1CF,TNPO2,TCGA-B6-A0RS-01A-11D-A099-09
A1CF,SYNCRIP,TCGA-B6-A0RS-01A-11D-A099-09
A1CF,KHSRP,TCGA-B6-A0RS-01A-11D-A099-09
A2M,SHBG,TCGA-D8-A1JK-01A-11D-A13L-09
A2M,C11orf58,TCGA-D8-A1JK-01A-11D-A13L-09
A2M,ATF7IP,TCGA-D8-A1JK-01A-11D-A13L-09
AAMP,TH1L,TCGA-A8-A08S-01A-11W-A050-09
AARS,EEF1B2,TCGA-AO-A0JC-01A-11W-A071-09
2nd Data.frame which holds the duplicated barcodes (csv)
Sample_Barcode
TCGA-A8-A08G-01A-11W-A019-09
TCGA-AO-A03O-01A-11W-A019-09
TCGA-AO-A03O-01A-11W-A019-09
TCGA-B6-A0RS-01A-11D-A099-09
TCGA-BH-A0HP-01A-12D-A099-09
TCGA-BH-A0HP-01A-12D-A099-09
TCGA-BH-A18H-01A-11D-A12B-09
TCGA-BH-A18H-01A-11D-A12B-09
TCGA-BH-A18J-01A-11D-A12B-09
TCGA-D8-A1JK-01A-11D-A13L-09
TCGA-E2-A1BC-01A-11D-A14G-09
TCGA-E2-A1BC-01A-11D-A14G-09
TCGA-E9-A1NH-01A-11D-A14G-09
TCGA-E9-A22B-01A-11D-A159-09
If the barcode from barcode.Int1 (dataframe 1) is 3 times in Sample_barcode the script should add a 3 next to the barcode.Int1 the script is looking for. for example:
HUGO.Int1,HUGO.Int2,barcode.Int1, number_of_times
A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09,5
Paul’s comment is very appropriate, it will speed up the merge step significantly. I would use
tableto get the counts of the unique barcodes in your second data.frame andmergeit onto your first, see below:The data.table version:
in “Pure” data.table: