I have got a dataframe made up by three columns (see example in the code). the first column contains categories (a), the second column the number of observations (b) and the third column the average value of these observations(c).
#create a test df
a<-factor(c("aaa","aaa","aaa","ddd","eee","ddd","aaa","ddd"))
b<-c(3,4,1,3,5,7,3,2)
c<-c(1,2,NA,4,5,6,7,NA)
df.abc<-data.frame(a=a,b=b,c=c)
df.abc
If the number of observations was 1 or 2 the entries where marked as missing values (NA).
So the aim of my function is to substitute theses missing values by the mean value of each category.
I took me while but I got a function working, that substitutes all the missing values for one category (in case that the observation was 1). It looks like this:
#function to substitue the missing values in row c by their means
#according to their categories
function.abc<-function(x){
ifelse(
(df.abc[,1]==x)&(df.abc[,2]==1),
mean(df.abc$c[df.abc$a ==x],na.rm=TRUE),
df.abc[,3]
)
}
Testing this function:
#test the function for the category "ccc"
function.abc("aaa")
It works quite well (but is only the mean rather than the average mean) The output is:
[1] 1.000000 2.000000 3.333333 4.000000 5.000000 6.000000 7.000000 NA
Now my problem is, that i have quite a lot of categories (n=32) and I tried to apply this function over a vector containing my categories. A simpe example in this case would be:
#test the function for a testvector
test.vector<-c("aaa","ddd")
function.abc(test.vector)
the output is:
[1] 1.0 2.0 4.5 4.0 5.0 6.0 7.0 NA
So obviously this won’t work out…
Can anybody help me to rearrange the function? I’m quite new to programming and it is still a big challenge for me to design short and goodworking functions…
Edit:
I would like the output to be:
[1] 1.000000 2.000000 3.20000 4.000000 5.000000 6.000000 7.000000 5.000000
so that the average of group aaa (3.20000) substitutes the NA value in aaa and the average of group ddd (5.0000000) substitutes the NA in ddd…
In order to work with multiple columns at once within a category you will need to use something that splits the dataframe and then works on the components. The
lapply( split(df, fac), function(x) {...})paradigm works well for this. Or you can usetransformor theplyrpackage.You can then
rbindthem using `do.call: