I’m using the XML package to scrape a list of websites. Specifically, i’m taking

Question

0

Asked: May 30, 20262026-05-30T04:49:58+00:00 2026-05-30T04:49:58+00:00

I’m using the XML package to scrape a list of websites. Specifically, i’m taking

0

I’m using the XML package to scrape a list of websites. Specifically, i’m taking ratings from a list of candidates, at the following site: votesmart.

The candidates’ pages are arranged in a numerical order, from 1 upwards. My first attempt, to scrape the first 50 candidates, looks like this

library(xml)
library(plyr)

url <- paste("http://www.votesmart.org/candidate/evaluations/", 1:50 , sep = "")
res <- llply(url, function(i) readHTMLtable(i))

But there are a couple of problems–for instance, the 25th page in this sequence generates a 404 "url not found" error. I’ve addressed this by first getting a data frame of the count of XML errors for each page in a sequence, and then excluding the pages which have a single error. Specifically

errors <- ldply(url, function(i) length(getXMLErrors(i)))
url2 <- url[which(errors$V1 > 1)]
res2 <- llply(url2, function(i) readHTMLTable(i))

In this way, I’ve excluded the 404 generating URLs from this list.

However, there’s still a problem, caused by numerous pages in the list, which cause this llply commands to fail. The following is an example

readHTMLTable("http://www.votesmart.org/candidate/evaluations/6")

which results in the error

Error in seq.default(length = max(numEls)) : 
  length must be non-negative number
In addition: Warning message:
In max(numEls) : no non-missing arguments to max; returning -Inf

However, these pages generate the same error count from the getXMLErrors command as the working pages, so I’m unable to distinguish between them on this front.

My question is–what does this error mean, and is there any way to get readHTMLTable to return an empty list for these pages, rather than an error? Failing that, is there a way I can my llply statement to check these pages and skip those which result in an error?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T04:49:59+00:00

Editorial Team

2026-05-30T04:49:59+00:00Added an answer on May 30, 2026 at 4:49 am

Why not just some simple error handling?

res <- llply(url, function(i) try(readHTMLTable(i)))

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using the XML package to scrape a list of websites. Specifically, i’m taking

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply