I’m dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove

Question

0

Asked: May 26, 20262026-05-26T17:40:01+00:00 2026-05-26T17:40:01+00:00

I’m dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove

0

I’m dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
In R, how can I remove rows with a factor that has a low total number of instances.

I’ve tried the following:
create a table for the student name factor

studenttable <- table(data$Anon.Student.Id)

returns a table

l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72 
  9890       7989       7665       7242       6928       6651

then I can get a table that tells me if there are more than 1000 data points for a given factor level

biginstances <- studenttable>1000

then I tried making a subset of the data on this query

bigdata <- subset(data, (biginstances[Anon.Student.Id]))

But I get weird subsets that still have the original number of factor levels as the full set.
I’m simply interested in removing the rows that have a factor that isn’t well represented in the dataset.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T17:40:02+00:00

this is how I managed to do this.
I sorted the table of factors and associated counts.

studenttable <- sort(studenttable, decreasing=TRUE)

now that it’s in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.

sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574

now we know the first 230 factor levels are the ones we care about. So, we can do

idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]

we can verify it worked by doing

bigstudenttable <- table(bigdata$Anon.Student.Id)

to get a print out and see all the factor levels with less than 1000 instances are now 0.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply