Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8353095
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T09:09:38+00:00 2026-06-09T09:09:38+00:00

Been using SO as a resource constantly for my work. Thanks for holding together

  • 0

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.

I’m trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that’s frowned upon in R)… I have records of three million-odd course enrollments: student UserID’s paired with CourseID’s. In each row, there’s a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she’s taken before the course in the enrollment.

The code I’m using for the for-loop follows:

data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
    sum <- 0
    count <- 0
    for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
            if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
                sum <- sum + data$Score[j]
                count <- count + 1
            }
    }
if (count != 0)
    data$Mean.Prior.Score[i] <- sum / count
}

I’m pretty sure this would work, but it runs incredibly slowly… my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.

Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn’t be too terribly slow…

There’s got to be a better way to do this… any suggestions?


Edit: As per mnel’s request… finally got dput to play nicely. Had to add control = NULL. Here ’tis:

structure(list(Username = structure(1:20, .Label = c("100225", 
"100226", "100228", "1013170", "102876", "105796", "106753", 
"106755", "108568", "109038", "110150", "110200", "110350", "111873", 
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"), 
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L, 
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L, 
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L, 
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L, 
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L, 
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L, 
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L, 
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A", 
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A", 
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B", 
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W", 
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"), 
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400, 
1309737600, 1099267200, 1098662400, 1099267200, 1099267200, 
1098662400, 1098662400, 1099267200, 1099267200, 1099267200, 
1098662400, 1104105600, 1098662400, 1098662400, 1104105600, 
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L, 
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L, 
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L, 
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L, 
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A", 
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"), 
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21", 
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"), 
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L, 
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27", 
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L, 
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27", 
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L, 
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L, 
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200", 
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438", 
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432", 
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"), 
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19", 
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username", 
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID", 
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date", 
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T09:09:39+00:00Added an answer on June 9, 2026 at 9:09 am

    I found that data.table worked well.

    # Create some data.
    library(data.table)
    set.seed(1)
    n=3e6
    numCourses=5 # Average courses per student
    data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
    data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
    setkey(data,UserID)
    # test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
    # I vastly reduced the number of comparisons with a better "test" function.
    test2=function(CourseEndDate,Score,CourseStartDate) {
        o.end = order(CourseEndDate)
        run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
        idx=findInterval(CourseStartDate,CourseEndDate[o.end])
        idx=ifelse(idx==0,NA,idx)
        run.avg[idx]
    }
    system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1) 
    #  For three million courses, at an average of 5 courses per student:
    #    user  system elapsed 
    #    122.06    0.22  122.45 
    

    Running a test to see if it looks the same as your code:

    set.seed(1)
    n=1e2
    data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
    data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
    setkey(data,UserID)
    data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
    data["246"]
    #   UserID course     Score CourseStartDate CourseEndDate MeanPriorScore
    #1:    246     54 0.4531314      2000-08-09    2000-09-20      0.9437248
    #2:    246     89 0.9437248      2000-02-19    2000-03-02             NA
    
    # A comparison with your for loop (slightly modified)
    data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
    # I think you forgot a bracket here. Also, There is no need to work with the rownames.
    for (i in seq(nrow(data))) { 
        sum <- 0
        count <- 0
        # I reduced the complexity of figuring out the vector to loop through.
        # It will result in the exact same thing if there are no rownames.
        for (j in which(data$UserID == data$UserID[i])) {
                if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
                    sum <- sum + data$Score[j]
                    count <- count + 1
                }
        }
        # I had to add "[i]" here. I think that is what you meant.
        if (count != 0) data$MeanPriorScore.old[i] <- sum / count 
    }
    identical(data$MeanPriorScore,data$MeanPriorScore.old)
    # [1] TRUE
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Ive been using bootstrap for a few months and am looking to clarify something
I'm trying to create a custom module using Magento's EAV structure and have been
This is a fantastic resource and I have been using it extensively during my
I've been working with Modernizr and it is a wonderful resource, just a great
I've been following the Microsoft Direct3D11 tutorials but using C# and SlimDX. I'm trying
I have been trying to display json data in my jsp using struts and
I've been using LABjs and it's working well. I'm now trying to make use
A resource extension I've been using for a few years now stopped working at
I have been using stackoverflow.com as a resource as a professional programmer for years
Been using XML for ages now for data storage & transfer, but have never

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.