I´m all new to scraping and I´m trying to understand xpath using R. My

Question

0

Asked: May 23, 20262026-05-23T15:02:05+00:00 2026-05-23T15:02:05+00:00

I´m all new to scraping and I´m trying to understand xpath using R. My

0

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :

r<-htmlTreeParse(e) ## e is after getURL 
    g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
    l<-g.k[names(g.k)=="text"]
    u<-ldply(l,function(x) {

        w<-xmlValue(x)
        return(w)
        })

However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?

I´ve come to

xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk

But this leaves me a lot of cleaning up to do and I assume it can be done better.

Regards,
//M

EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:

getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2

gives me the list of what I want. However still in xml with br tags. I thought running

xpathApply(e2, "//text()", function(k) xmlValue(k))->kk

would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.

Is there a way to do this directly:

xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk

Link to the web page: I´m trying to get the names, and only, the names from the page.

getURL("http://legeforeningen.no/id/1712")

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T15:02:05+00:00

I ended up with

xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)

(no need for RCurl) and then

sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))

(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.

n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
    xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})

Unfortunately, this does not pick up names that do not contain a comma.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I´m all new to scraping and I´m trying to understand xpath using R. My

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply