I’m dealing with a data set that has some obvious errors in the data (ie kid that’s < 1yr old with a $50,000 credit card balance). I can’t go thru line by line as set is >100k lines. Is there any formal work done on how to search for these types of obvious problems in data sets or even better any packages in R? Or should I just start doing histograms?
Share
As far as I know there is no such package. It seems like what you’re asking for is very specialized. I think you’re really looking for anomalies or outliers. Though it would be cool to have some thing that regressed all variables on the others and searched for potential extreme outliers (probably not that hard to make)
2 thoughts:
1) a scatterplot of variable’s you’ll conect such as age and income. Even with 100k lines that one (1 yr old making 50K) would pop up way away from all the others.
2) Running regression and looking at the plot of the model. There’s some pretty good outlier detection there.
3) Search through the standardized residuals and look for values above 2 or most likely 3 sd’s with which statement that indexes the observation numbers of the data.
Something like:
dataframe[which(rstandard(model)>3), ]