I have got a dataframe made up by three columns (see example in the

Question

0

Asked: June 14, 20262026-06-14T21:38:25+00:00 2026-06-14T21:38:25+00:00

I have got a dataframe made up by three columns (see example in the

0

I have got a dataframe made up by three columns (see example in the code). the first column contains categories (a), the second column the number of observations (b) and the third column the average value of these observations(c).

    #create a test df
    a<-factor(c("aaa","aaa","aaa","ddd","eee","ddd","aaa","ddd"))
    b<-c(3,4,1,3,5,7,3,2)
    c<-c(1,2,NA,4,5,6,7,NA)
    df.abc<-data.frame(a=a,b=b,c=c)
    df.abc

If the number of observations was 1 or 2 the entries where marked as missing values (NA).

So the aim of my function is to substitute theses missing values by the mean value of each category.

I took me while but I got a function working, that substitutes all the missing values for one category (in case that the observation was 1). It looks like this:

    #function to substitue the missing values in row c by their means 
    #according to their categories
    function.abc<-function(x){
        ifelse(
            (df.abc[,1]==x)&(df.abc[,2]==1),
            mean(df.abc$c[df.abc$a ==x],na.rm=TRUE),
            df.abc[,3]
        )
    }

Testing this function:

    #test the function for the category "ccc"
    function.abc("aaa")

It works quite well (but is only the mean rather than the average mean) The output is:

[1] 1.000000 2.000000 3.333333 4.000000 5.000000 6.000000 7.000000 NA

Now my problem is, that i have quite a lot of categories (n=32) and I tried to apply this function over a vector containing my categories. A simpe example in this case would be:

    #test the function for a testvector
    test.vector<-c("aaa","ddd")
    function.abc(test.vector)

the output is:

[1] 1.0 2.0 4.5 4.0 5.0 6.0 7.0 NA

So obviously this won’t work out…

Can anybody help me to rearrange the function? I’m quite new to programming and it is still a big challenge for me to design short and goodworking functions…

Edit:

I would like the output to be:
[1] 1.000000 2.000000 3.20000 4.000000 5.000000 6.000000 7.000000 5.000000

so that the average of group aaa (3.20000) substitutes the NA value in aaa and the average of group ddd (5.0000000) substitutes the NA in ddd…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T21:38:26+00:00

In order to work with multiple columns at once within a category you will need to use something that splits the dataframe and then works on the components. The lapply( split(df, fac), function(x) {...}) paradigm works well for this. Or you can use transform or the plyr package.

> lapply( split( df.abc, df.abc$a), 
               function(dfrm) { dfrm[is.na(dfrm$c), "c"] <- 
                  weighted.mean(dfrm[!is.na(dfrm$c) , "c"], dfrm[!is.na(dfrm$c), "b"])
                         dfrm} )  
                # need to evaluate dfrm in order to return the full value.
$aaa
    a b   c
1 aaa 3 1.0
2 aaa 4 2.0
3 aaa 1 3.2
7 aaa 3 7.0

$ddd
    a b   c
4 ddd 3 4.0
6 ddd 7 6.0
8 ddd 2 5.4

$eee
    a b c
5 eee 5 5

You can then rbind them using `do.call:

 do.call( rbind, lapply( split( df.abc, df.abc$a), 
          function(dfrm) { dfrm[is.na(dfrm$c), "c"] <-
                 weighted.mean(dfrm[!is.na(dfrm$c) , "c"], dfrm[!is.na(dfrm$c), "b"])
                   dfrm} ) )
        a b   c
aaa.1 aaa 3 1.0
aaa.2 aaa 4 2.0
aaa.3 aaa 1 3.2
aaa.7 aaa 3 7.0
ddd.4 ddd 3 4.0
ddd.6 ddd 7 6.0
ddd.8 ddd 2 5.4
eee   eee 5 5.0

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have got a dataframe made up by three columns (see example in the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply