I have got a huge 1000 x 100000 dataframe like following to recode to numberic values.
myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
)
myd
v1 v2 v3 v4 v5
1 AB CC <NA> <NA> AA
2 AB CG TT TT AA
3 AA GG AT AT CA
4 <NA> <NA> <NA> AT <NA>
5 AA <NA> AA <NA> CA
6 BB <NA> TT TT CC
7 AA GG AA AT CA
8 <NA> GG <NA> AT CA
9 AA <NA> AT <NA> CC
10 AA GG TT AA CC
Each variables have potentially four unique values.
unique(myd$v1)
[1] AB AA <NA> BB
Levels: AA AB BB
unique(myd$v2)
[1] CC CG GG <NA>
Levels: CC CG GG
Such unique values can be any combination however consists of two alphabets (-except NA). For example “A”, “B” in first case will make combination “AA”, “AB”, “BB”. The numberical code for these would be 1, 0, -1 respectively. Similarly for second case alphabets “C”, “G” makes “CC”, “CG”, “GG”, thus the numberical codes would be 1, 0, -1 respectively. Thus the above myd need to be recoded to:
myd
v1 v2 v3 v4 v5
1 0 1 <NA> <NA> 1
2 0 0 -1 -1 1
3 1 -1 0 0 0
4 <NA> <NA> <NA> 0 <NA>
5 1 <NA> 1 < NA> 0
6 -1 <NA> -1 -1 -1
7 1 -1 1 0 0
8 <NA> -1 <NA> 0 0
9 1 <NA> 0 <NA> -1
10 1 -1 -1 1 -1
You can take advantage of the fact that your data are factors, which have numeric indices underneath them.
For example:
The numeric values correspond to the
levels()of the factor:So 1 ==
AA, 2 ==AB, 3 ==BB…and so on.So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. So we can subtract by 2, and then multiply by -1 to get your results: