I have the following type of data:
Person <- c("A", "B", "C", "D", "E", "E", "F", "G", "H", "I")
MOM <- c( NA, NA, NA, "A", "A", NA, "A", "B", "C", NA)
DAD <- c( NA, NA, NA, "B", "B", NA, "E", "A", "B", NA)
Xv <- 1:10
myd <- data.frame (Person, MOM, DAD, Xv, stringsAsFactors=F)
myd
Person MOM DAD Xv
1 A <NA> <NA> 1
2 B <NA> <NA> 2
3 C <NA> <NA> 3
4 D A B 4
5 E A B 5
6 E <NA> <NA> 6
7 F A E 7
8 G B A 8
9 H C B 9
10 I <NA> <NA> 10
This data include Person and their Mom and Dad columns. I would like to create family group for this data. NA is information missing. A family is defined that has common MOM and DAD. Founders are those that have both NA, family = 0.
Here is what I could figure out, which is imperfect for me:
fun <- function(i) {
i1 <- if (is.na(myd[i, 2])) i else match(myd[i, 2], myd[1:i, 2])
i2 <- if (is.na(myd[i, 3])) i else match(myd[i, 3], myd[1:i, 3])
min(i1, i2)
}
myd$family <- as.numeric(factor(sapply(1:nrow(myd), fun)))
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 2
3 C <NA> <NA> 3 3
4 D A B 4 4
5 E A B 5 4
6 E <NA> <NA> 6 5
7 F A E 7 4
8 G B A 8 6
9 H C B 9 4
10 I <NA> <NA> 10 7
The above function is imperfect in the sense:
The family data do not include data of their parents, for example family 4 should include
data for A and B. Thus complete family would look like:
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 2
4 D A B 4 4
5 E A B 5 4
Another thing (at least for my purpose is), Being DAD = A and MOM = B is same as DAD = B, and MOM = A. Thus the family 4 and 6 are product of same A and B parents, so should be
same.
4 D A B 4 4
5 E A B 5 4
8 G B A 8 6
Thus expected output is:
Person MOM DAD Xv family
# founders
1 A <NA> <NA> 1 0
2 B <NA> <NA> 2 0
3 C <NA> <NA> 3 0
10 I <NA> <NA> 10 0
6 E <NA> <NA> 6 0
# Family 1
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
8 G B A 8 1
# Family 2
1 A <NA> <NA> 1 2
6 E <NA> <NA> 6 2
7 F A E 7 2
# Family 3
2 B <NA> <NA> 2 3
3 C <NA> <NA> 3 3
9 H C B 9 3
Edits:
It is pity (good !) in human genetics we need to work on similar variables – family, trio, mom (parent1, mother, female), father (dad, parent2, male), individual / subject etc. This makes everything similar and issue are similar.
Family vs Trio
1 Nuclear family
A x B
|
C D E
Trio -> 3 trios
A x B A x B A x B
| | |
C D E
Edits from the questioner: I you agree with the comments below as homework, please do not anwer the question for sometime (the time you think good enough that homework submission time has passed). If I get answer I will post it later (in 3 months or so).
Edits
Founders definition – those who have both parents unknown whether they are any sons / daughters, so they have in both MOM and DAD columns. These are considered family 0 as they are part of other families but the list is not real family.
Person MOM DAD Xv family
1 A <NA> <NA> 1 0
2 B <NA> <NA> 2 0
3 C <NA> <NA> 3 0
10 I <NA> <NA> 10 0
6 E <NA> <NA> 6 0
** Family definition * A family consists of parents (MOM and DAD) and all son and daughters. If Person DAD and MOM matches with Another Person DAD and MOM, they should be considered a family. For example, D and E person in the following list has MOM = A and DAD = B, these two individuals together with D and E consists of a family. Now we need to recycle data for their parents (A and B ) from the founders list (family 0).
# Family 1
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
Also in contrary to human situation here a individual can be MOM or DAD (can switch sex), so progeny produced by A (MOM) and B (DAD) are same as pro-genies developed by B (MOM) and A(DAD), thus we need to add the following to individual to family 1 list.
Person MOM DAD Xv family
8 G B A 8 1
Thus complete list for family 1 becomes:
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
8 G B A 8 1
The family 1 can be diagrammatically sketched as:
MOM x DAD MOM x DAD
A | B or B | A
----------------- ------
| | |
D E G
Here is partial solution:
myd1 <- data.frame(myd$DAD, myd$MOM)
myd$family<-as.factor(apply(myd1,1,function(x){paste(x[order(x)],collapse='-')}))
Person MOM DAD Xv family
1 A <NA> <NA> 1 NA-NA
2 B <NA> <NA> 2 NA-NA
3 C <NA> <NA> 3 NA-NA
4 D A B 4 A-B
5 E A B 5 A-B
6 E <NA> <NA> 6 NA-NA
7 F A E 7 A-E
8 G B A 8 A-B
9 H C B 9 B-C
10 I <NA> <NA> 10 NA-NA
It does not give family number rather family of A and B. NA-NA is founders and it orders before collapse so the A-B becomes B-A.
What is issue remaining is that A-B family needs data from Person A and B recycled (although they are in family NA-NA group) .
Person MOM DAD Xv family
1 A <NA> <NA> 1 NA-NA
2 B <NA> <NA> 2 NA-NA
4 D A B 4 A-B
5 E A B 5 A-B
I’m not sure if you’ve figured this out yet, but here is one solution.
First, your data:
Second, we identify the families by merging together columns 2 and 3 from your original data. We will use this to
splityourdata.frameinto a list.Third, we split the
data.frameinto a list. In this case, we end up with a list of fourdata.frames: one for the founders, and one for each family.Fourth, we do some simple matching and subsetting to identify which founders belong to which families.
And, finally, we
rbindthis data together.This is the output:
Update: data.frame output
If you prefer a
data.frameto alist, you can do the following after completing the previous steps: