I would like to ask you for efficiency suggestions for a specific coding problem

Question

0

Asked: June 13, 20262026-06-13T00:23:33+00:00 2026-06-13T00:23:33+00:00

I would like to ask you for efficiency suggestions for a specific coding problem

0

I would like to ask you for efficiency suggestions for a specific coding problem in R. I have a string vector in the following style:

[1] "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1/1;CANONICAL=YES"
[2] "DISTANCE=2179"                                              
[3] "HGVSc=ENST00000466430.1:n.911C>T;EXON=4/4;CANONICAL=YES"    
[4] "DISTANCE=27;CANONICAL=YES;common"

In each element of the vector, the single entries are separated with a ; and MOST of the single entries have the format KEY=VALUE. However, there are also some entries, which only have the format KEY (see “common” in [4]). In this example, there are 15 different keys and not every key appears in each element of the vector. The 15 different keys are:

names <- c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common')

From this vector I would like to create a dataframe that looks like this:

ENSP HGVS DOMAINS EXON INTRON HGVSp                        HGVSc CANONICAL
1    -    -       -    -    1/1     - ENST00000495576.1:n.820-1G>A       YES
2    -    -       -    -      -     -                            -         -
3    -    -       -  4/4      -     -   ENST00000466430.1:n.911C>T       YES
4    -    -       -    -      -     -                            -       YES
GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
1    -        -    -    -    -        -      -
2    -     2179    -    -    -        -      -
3    -        -    -    -    -        -      -
4    -       27    -    -    -        -    YES

I wrote this function to solve the problem:

unlist.info <- function(names, column){
  info.mat <- matrix(rep('-', length(column)*length(names)), nrow=length(column), ncol=length(names), dimnames=list(c(), names))
  info.mat <- as.data.frame(info.mat, stringsAsFactors=F)

  for (i in 1:length(column)){
    info <- unlist(strsplit(column[i], "\\;"))
    for (e in info){
      e <- unlist(strsplit(e, "\\="))
      j <- which(names == e[1])
      if (length(e) > 1){
        # KEY=VALUE. The value might contain a = as well
        value <- paste(e[2:length(e)], collapse='=')
        info.mat[i,j] <- value
      }else{
        # only KEY
        info.mat[i,j] <- 'YES'
      }
    }
  }
  return(info.mat)
}

And then I call:

mat <- unlist.info(names, vector)

Even though this works, it is really slow. Also I am handling vectors with over 100.000 entries. Now I realize that looping is inelegant and inefficient in R and I am familiar with the concept of applying functions to data frames. However, since every entry of the vector contains a different subset of KEY=VALUE or KEY entries I could not come up with a more efficient function.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T00:23:34+00:00

Here you go:

Recreate the data:

x <- c(
  "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES",
  "DISTANCE=2179",
  "HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES",
  "DISTANCE=27;CANONICAL=YES;common"
)

Create a named vector with your desired names. This is used for fast lookup later:

names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))

Create a helper function that assigns each variable to the correct position in a matrix. Then use lapply and strsplit:

assign <- function(x, names){
  xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES"))
  z <- rep(NA, length(names))
  z[names[xx[1, ]]] <- xx[2, ]
  z
}

sx <- lapply(strsplit(x, ";"), strsplit, "=")
ret <- t(sapply(sx, assign, names))
colnames(ret) <- names(names)
ret

The results:

     ENSP HGVS DOMAINS EXON   INTRON HGVSp HGVSc                          CANONICAL GMAF DISTANCE HGNC
[1,] NA   NA   NA      NA     "1//1" NA    "ENST00000495576.1:n.820-1G>A" "YES"     NA   NA       NA  
[2,] NA   NA   NA      NA     NA     NA    NA                             NA        NA   "2179"   NA  
[3,] NA   NA   NA      "4//4" NA     NA    "ENST00000466430.1:n.911C>T"   "YES"     NA   NA       NA  
[4,] NA   NA   NA      NA     NA     NA    NA                             "YES"     NA   "27"     NA  
     CCDS SIFT PolyPhen common
[1,] NA   NA   NA       NA    
[2,] NA   NA   NA       NA    
[3,] NA   NA   NA       NA    
[4,] NA   NA   NA       "YES"

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I would like to ask you for efficiency suggestions for a specific coding problem

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply