Here is a sample:
> tmp
label value1 value2
1 aa_x_x xx xx
2 bc_x_x xx xx
3 aa_x_x xx xx
4 bc_x_x xx xx
How to calculate median of all repeated labels (or more, of the corresponding values in other data frame columns), but taking into account only the first two letters (ie. “aa_1_1” and “aa_s_3” are the same values)? The list of labels is finite and usable.
I have read about aggregate, %in%, subset and substr, but I am unable to compile anything useful and simple.
Here is what I hope to get:
> tmp.result
label median1 some.calculation2
1 aa xx xx
2 bc xx xx
3 aa xx xx
4 bc xx xx
Thank you very much.
Have you tried making a new data frame–I’ll call it
tmp2–wheretmp2$label==substr(tmp$label,0,2)? From there, you can, for example, usetapply(tmp2$value1,tmp2$label,mean)to get the average values ofvalue1aggregated overtmp2$label.An option using
dplyrOr
data.table