Given the following example:
sites=c('site 1','site 2')
link=c('<a href="http://example.com/path">This website</a>', '<a href="http://example.com/path2">That website</a>')
w=data.frame(link,sites)
w
link sites
<a href="http://example.com/path">This website</a> site 1
<a href="http://example.com/path2">That website</a> site 2
how do I apply a regular expression that will parse the html snippet to extract the url and the link text and pop them into separate columns in a data frame? So for example, given the above example, what do I need to do in order to generate a data frame that looks like:
url name sites
http://example.com/path This website site 1
http://example.com/path2 That website site 2
Here is a solution using the
htmlTreeParsefunction from packageXML(i.e., without regular expressions)