I’m using the XML package to scrape a list of websites. Specifically, i’m taking ratings from a list of candidates, at the following site: votesmart.
The candidates’ pages are arranged in a numerical order, from 1 upwards. My first attempt, to scrape the first 50 candidates, looks like this
library(xml)
library(plyr)
url <- paste("http://www.votesmart.org/candidate/evaluations/", 1:50 , sep = "")
res <- llply(url, function(i) readHTMLtable(i))
But there are a couple of problems–for instance, the 25th page in this sequence generates a 404 "url not found" error. I’ve addressed this by first getting a data frame of the count of XML errors for each page in a sequence, and then excluding the pages which have a single error. Specifically
errors <- ldply(url, function(i) length(getXMLErrors(i)))
url2 <- url[which(errors$V1 > 1)]
res2 <- llply(url2, function(i) readHTMLTable(i))
In this way, I’ve excluded the 404 generating URLs from this list.
However, there’s still a problem, caused by numerous pages in the list, which cause this llply commands to fail. The following is an example
readHTMLTable("http://www.votesmart.org/candidate/evaluations/6")
which results in the error
Error in seq.default(length = max(numEls)) :
length must be non-negative number
In addition: Warning message:
In max(numEls) : no non-missing arguments to max; returning -Inf
However, these pages generate the same error count from the getXMLErrors command as the working pages, so I’m unable to distinguish between them on this front.
My question is–what does this error mean, and is there any way to get readHTMLTable to return an empty list for these pages, rather than an error? Failing that, is there a way I can my llply statement to check these pages and skip those which result in an error?
Why not just some simple error handling?