I am looking for a way to delete the characters at certain positions within a string in R. For example, if we have a string "1,2,1,1,2,1,1,1,1,2,1,1", I want to delete the third, fourth, 7th and 8th position. The operation would make the string: "1,1,2,1,1,1,1,2,1,1".
Unfortunately, breaking the string into a list using strsplit is not an option, because the strings I am working with are over 1 million characters long. Considering I have about 2,500 strings, it works out to be quite some time.
Alternatively, finding a way to replace the characters with an empty string "" would achieve the same purpose – I think. Looking into this line of thought, I came across this StackOverflow post:
R: How can I replace let's say the 5th element within a string?
Unfortunately, the solution suggested is hard to efficiently generalize and the following takes about 60 seconds per input string for a list of 2000 positions to remove:
subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}
Looking into the problem, I found a snippet of code, that seems to replace the characters at certain positions with "-":
subchar <- function(string, pos) {
for(i in pos) {
string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
}
return(string)
}
I don’t quite understand regular expression (yet), but I have a strong suspicion something along these lines will be much better time-wise than the first code solution. Unfortunately, this subchar function seems to break when the values in pos gets high:
> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'
I was also considering trying to read in the string data into a table using SQL, but I was hoping that there would be a elegant string solution. The SQL implementation in R to do this seems rather complicated.
Any ideas?
Thanks!
strsplitis more than ten times faster if you usefixed = TRUE. Rough extrapolation and it will take a little over 2 minutes to process your 2,500 strings of 1,000,000 comma separated integers.This is almost 3 times faster than using scan: