I’m having a big trouble on dealing with levels names of a data frame.
I have a big data frame in which one of the colums is a factor with a LOT of levels.
The problem is that some of this data are duplicated and the next step in my analysis do not accept duplicated data. So I need to change the name of the duplicated level so I can move on to my next step.
Let me give you a little example:
Say we have this simple data frame with one colum:
> df
col_foo
1 bar1
2 bar2
3 bar3
4 bar2
5 bar4
6 bar5
7 bar3
If we look at the column, we see that it is a factor with 5 distinct levels.
>df$col_foo
[1] bar1 bar2 bar3 bar2 bar4 bar5 bar3
Levels: bar1 bar2 bar3 bar4 bar5
Ok, the problem comes now. See that levels bar2 and bar3 are duplicated. What I want to know is how can I add a level name, something like bar2_X and substitute only the duplicated one for this. So the dataframe should become this:
> df
col_foo
1 bar1
2 bar2
3 bar3
4 bar2_X
5 bar4
6 bar5
7 bar3_X
Is that possible ? I cannot change the class of the column, it should still be a factor, so solutions that need to change it will not solve my problem unless it is possible to coerce to factor again.
Thanks
If you want all the entries to be unique then a factor does not gain you much over just using a character variable.
Probably the simplest way to do what you want is to coerce to a character vector, use the
duplicatedfunction to find the duplicates and paste something onto the end of them, then if you want usefactorto recoerce it back to a factor. Possibly something like: