I am gathering data about different universities and I have a question about the

Question

0

Asked: June 5, 20262026-06-05T05:15:22+00:00 2026-06-05T05:15:22+00:00

I am gathering data about different universities and I have a question about the

0

I am gathering data about different universities and I have a question about the follow error after executing the following code. The problem is when using htmlParse()

Code:

url1 <- “http://nces.ed.gov/collegenavigator/?id=165015”

webpage1<- getURL(url1)

doc1 <- htmlParse(webpage1)

Output:

Error in htmlParse(webpage1) : File

!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”

html xmlns=”http://www.w3.org/1999/xhtml” head id=”ctl00_hd”meta http-equiv=”Content-type” content=”text/html;charset=UTF-8″ /title

    College Navigator - National Center for Education Statistics

/titlelink href=”css/md0.css” type=”text/css” rel=”stylesheet” meta name=”keywords” content=”college navigator,college search,postsecondary education,postsecondary statistics,NCES,IPEDS,college locator”/meta meta name=”description” content=”College Navigator is a free consumer information tool designed to help students, parents, high school counselors, and others get information about over 7,000 postsecondary institutions in the United States – such as programs offered, retention and graduation rates, prices, aid available, degrees awarded, campus safety, and accreditation.”meta>meta name=”robots” content=”index,nofollow”/metalink

I have webs scraped pages before using this package and I never had an issue. Does the name=”robots” have anything to do with it? Any help would be greatly appreciate.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T05:15:24+00:00

http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fnces.ed.gov%2Fcollegenavigator%2F%3Fid%3D165015
indicates the webpage is badly formed. Your browser can compensate for this but your R package is having problems.

if you are using windows you can get the IE browser to fix it for you as follows:

library(rcom)
library(XML)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
comInvoke(ie,"Navigate2","http://nces.ed.gov/collegenavigator/?id=165015")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
 Sys.sleep(1)
 print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
webpage1<-myDoc$getElementsByTagName('html')[[0]][['innerHTML']]
ie$Quit()
doc1 <- htmlParse(webpage1)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am gathering data about different universities and I have a question about the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply