I’m trying to get the values of ‘Dated Posted’ and ‘Date Updated’ as pictured here. The website url is: http://sulit.com.ph/3991016
I have a feeling I should be using xpathSApply, as suggested in this thread Web Scraping (in R?), but I just can’t get it to work.
url = "http://sulit.com.ph/3991016"
doc = htmlTreeParse(url, useInternalNodes = T)
date_posted = xpathSApply(doc, "??????????", xmlValue)
Also does anyone know a quick way to get the phrase ‘P27M’ also listed in the website? Help would be appreciated.
Here’s another way to do it.
There’s no need to use RCurl as htmlParse will parse urls. getNodeSet will return a list with the nodes that have “Date Posted” or “Date Updated” as values. The lapply loops over both of those nodes and first finds the parent node then the value of the second “span” node. This part may not be very robust if the website changes its formatting for different pages (which after looking at the html for that site seems very possible). SlowLearner’s gsub cleans up both dates. I added strptime to return the dates as a date class, but that step is optional and depends on how you plan to use the info in the future. HTH