I’m dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
In R, how can I remove rows with a factor that has a low total number of instances.
I’ve tried the following:
create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set.
I’m simply interested in removing the rows that have a factor that isn’t well represented in the dataset.
this is how I managed to do this.
I sorted the table of factors and associated counts.
now that it’s in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.
now we know the first 230 factor levels are the ones we care about. So, we can do
we can verify it worked by doing
to get a print out and see all the factor levels with less than 1000 instances are now 0.