I have an online survey dataset in which there are multiple complete attempts by participants and I need to selectively remove several of the cases by row number. The data is stored as a data.frame. I realize I could do this manually, but I want to keep this as a script so that I can use it later if need be or someone can duplicate what I’ve done quickly and efficiently.
What I have tried: I have searched in multiple locations, but my question seems too simple. I have looked at removing rows based on incomplete cases (‘complete.cases’ and ‘na.omit’), but this is not specifically want I want as I am trying to remove a row based on a specific vector within the data.frame
The data:
user_id var1 var2 var3
1 NA 13 bob
3 time 37 fred
4 second NA lisa
5 second 28 lisa
So, in the above data.frame I have multiple attempts by user lisa. I want to keep her last attempt because it is more complete (no NA in var2), but I need to remove the row based on user_id rather than var3.
Starting with:
first compute the completeness score by summing the number of non-NA values in var1 to var3:
Then find the row with max(score) in each group. There’s probably an easier way to do this:
If someone has two rows with the same score they’ll appear twice:
Now if I recompute pick I get bob twice:
Which can be fixed by just returning the first match in the pick calculation:
You didn’t say what you wanted doing with duplicates…
Someone will probably have a one-liner posted in a tic…