I’ve done some web-scraping and have a character string, my_html, with links that I would like
I want to grep() and tried (an example of my_html which is much longer)
my_html <- 'stuff more stuff ...
<TD ><A CLASS=my_link HREF=\"https://www.stuff.com/secure-bin/my_club/myrep.cgi/tpw9109.cry?scrtpw9109.cry\">
other stuff
<p> www.google.com </p>
end'
my_pattern <- "<TD><A CLASS=my_link HREF=*>"
grep(my_pattern,x=my_html,value=TRUE)
which gets me
character(0)
I think the problem is to do with the special characters in the pattern, but I don’t know the remedy.
Basically throws away anything before the
HREF=\"using 2 backslashes to represent a single backslash and\"to represent a double quote. Then includes anything before the next double quote as the second matched section and anything from that mark to the end as the third section. So it should return only the middle matching section (if one exists).