I have a huge dataset in which there is one column including several values

Question

0

Asked: June 18, 20262026-06-18T16:30:43+00:00 2026-06-18T16:30:43+00:00

I have a huge dataset in which there is one column including several values

0

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:

data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2), 
              age = c(35, 29, 31, 46, 64, 57, 49, 58), 
              v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))

> data
  subject sex age          v1
1       1   1  35           2
2       2   2  29           0
3       3   2  31         3,5  # separated by a comma
4       4   1  46         2 1  # separated by a blank space
5       5   2  64         A,4
6       6   1  57       B,1,C
7       7   1  49   A and B,3
8       8   2  58 5, 6 A or C

I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:

  subject sex age x1 x2 x3 x4 x5 x6
1       1   1  35  0  1  0  0  0  0        
2       2   2  29  0  0  0  0  0  0
3       3   2  31  0  0  1  0  1  0  
4       4   1  46  1  1  0  0  0  0
5       5   2  64  0  0  0  1  0  0
6       6   1  57  1  0  0  0  0  0
7       7   1  49  0  0  1  0  0  0
8       8   2  58  0  0  0  0  1  1

where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.

I would appreciate any help on this question. Thanks a lot.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T16:30:44+00:00

One solution:

r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
        y <- rep(0, 6)
        y[x[!is.na(x)]] <- 1
        y
     })))
data <- cbind(data[, c("subject", "sex", "age")], m)

#   subject sex age V1 V2 V3 V4 V5 V6
# 1       1   1  35  0  1  0  0  0  0
# 2       2   2  29  0  0  0  0  0  0
# 3       3   2  31  0  0  1  0  1  0
# 4       4   1  46  1  1  0  0  0  0
# 5       5   2  64  0  0  0  1  0  0
# 6       6   1  57  1  0  0  0  0  0
# 7       7   1  49  0  0  1  0  0  0
# 8       8   2  58  0  0  0  0  1  1

Following DWin’s awesome solution, m could be modified as:

m <- as.data.frame(t(sapply(r, function(x) {
        0 + 1:6 %in% x[!is.na(x)]
     })))

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a huge dataset in which there is one column including several values

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply