I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.
The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..
I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Output
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391
Using regular expressions:
A bit of explanation:
.means “any character”..*means “any number of characters”..*?means “any number of characters, but do not be greedy.\\1,\\2, etc.$means end of the line (or string).So here, the pattern matches the whole line, and captures two things via the two
(.*?): theHG-Focus(or other) thing you don’t want as\\1and your id as\\2. By setting the replacement to\\2, we are effectively replacing the whole string with your id.I now realize it was not necessary to capture the first thing, so this would work just as well: