I have an operation I’d like to run for each row of a data frame, changing one column. I’m an apply/ddply/sqldf man, but I’ll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I’ve stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That’s crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is “for loop with nested indices” even the best way to think about my problem?
Using
rowandcolseems less complicated to me:I think Tommy’s is still faster, but using
rowandcolmight be easier to understand.