I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc@xyz.com to abc@. I don’t want this, so I modified it to
gsub("*((^@)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!@)[^(?!.*@)]*.com",'',testset[10])
I got a ‘invalid regex’ error. I believe a little help in correcting may get this to work…
I can’t believe it. There actually is a simple solution to it.
This works by:
Please do look into breaking it! I’m sure there will be cases that will break this as well.