I am not sure the title is clear enough.
I have a dataframe (see below) which contains values across 5 columns. What I would like to do is to "split" this dataframe into three classes where the rows can be assigned into a "High", "Medium", "Low" state.
What I mean is :
High: the values are "high" in at least 3 columns
Medium: the values are "medium" in a least 3 columns
Low: the values are "Low"(or NA) in a least 3 columns
I guess it involve two things, defining the value cutoff for the 3 groups, then assinging rows into High, Medium and Low category… but thats a guess
The data file is available here
tmp = read.table("tmp2.txt", header=TRUE)
head(tmp)
Geneid Hsap Mmul Mmus Rnor Cfam
1 ENSG00000197711 365823.5 243429.20 44337.267 156874.50 128015.0
2 ENSG00000198712 198613.0 NA 47767.767 200176.50 210559.8
3 ENSG00000198899 189421.5 NA NA 283425.50 367112.8
4 ENSG00000198804 182559.5 NA 87301.900 277861.00 324438.0
5 ENSG00000198840 142424.5 NA 8400.457 45844.80 115027.9
6 ENSG00000171564 119147.9 93564.66 6675.290 45938.85 45140.2
Any advices strongly appreciated, as I don’t have the slightest idea on how to tackle this !
Thanks,
This is the answer below :
I have now replaced the file by a more realistic one (more rows)
tbl <- read.csv("http://db.tt/L2ehGh8", header=FALSE)
colnames(tbl) <- c("Geneid","Hsap","Mmul","Mmus","Rnor","Cfam")
Using cut() :
I have lots of 0s, and the values are quiet stretched, so by using log, or here asinh, you get rid of this.
tbl.data <- apply(asinh(tbl.data),2,
function(x) as.numeric(as.factor(cut(x,4))) )
head(tbl.data)
Hsap Mmul Mmus Rnor Cfam
[1,] 2 2 1 1 2
[2,] 2 2 2 2 2
[3,] 1 1 1 1 1
[4,] 1 1 1 1 1
[5,] 2 3 2 2 3
[6,] 2 2 2 2 2
Another way is to use Quantiles, which as been shown to me.
quantile(tbl.data[,1],0.25)
quantile(tbl.data[,1],0.5)
quantile(tbl.data[,1],0.75)
tbl.data2 <- apply(tbl.data,2,
function(x) as.numeric(as.factor(cut(x,c(-1,
quantile(x, 0.25)+0.0001,
quantile(x,0.5),
quantile(x,0.75), max(x))))))
head(tbl.data2)
Hsap Mmul Mmus Rnor Cfam
[1,] 3 3 3 2 3
[2,] 2 3 4 3 3
[3,] 2 1 1 1 2
[4,] 1 2 1 1 1
[5,] 4 4 4 4 4
[6,] 3 4 4 3 4
Assuming you want
NAs to be handled by not counting them rather than tossing the whole row:Which returns the data.frame as follows: