I know this question had been asked here and here but there was a small problem when I tried it out:
x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"
I changed the regex to "#(.+) ?", "#\\s", but they did not extract the hashtags.
I then tried the gsub way:
x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")
It gave: " # . #"
Any ideas where I am going wrong? I’d like my output as a vector/list of all the hashtags in the tweet(without the hashes!)
Edit: I would prefer not tokenizing the tweet, because:
1. I am not tokenizing the tweets for the rest of my program,
2. It would become a very expensive step were I to scale it to handle large volumes of tweets.
Use
"#\\S+"instead of"#\S+".There are two levels of parsing going on here. Before the low level regexp function within
str_extractgets the pattern you want to search for (i.e."#\S+") it is first parsed by R. R does not recognize\Sas a valid escape character and throws an error. By escaping the slash with\\you tell R to pass the\andSas two normal characters to the regexp function, instead of interpreting it as one escape character.Side track
This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of
"\\computer". To search for it you would need to typestr_extract(adr, "\\\\\\w+")which would turn into"\\\w+"internally and then search for.