I’m using Python 3.2.3’s urllib.request module to download Google search results, but I’m getting an odd error in that urlopen works with links to Google search results, but not Google Scholar. In this example, I’m searching for "JOHN SMITH". This code successfully prints HTML:
from urllib.request import urlopen, Request
from urllib.error import URLError
# Google
try:
page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
print(html_google[0:10])
except URLError as e:
print(e)
but this code, doing the same for Google Scholar, raises a URLError exception:
from urllib.request import urlopen, Request
from urllib.error import URLError
# Google Scholar
try:
page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
req_scholar = Request(page_scholar)
req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_scholar = urlopen(req_scholar).read()
print(html_scholar[0:10])
except URLError as e:
print(e)
Traceback:
Traceback (most recent call last):
File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
html = urlopen(page).read()
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>
I obtained these links by searching in Chrome and copying the link from there. One commenter reported a 403 error, which I sometimes get as well. I presume this is because Google doesn’t support scraping of Scholar. However, changing the User Agent string doesn’t fix this or the original problem, since I get URLErrors most of the time.
This PHP script seems to indicate you’ll need to set some cookies before Google gives you results:
This is corroborated by Python recipe for Google Scholar comment, which includes a warning that Google detects scripts and will disable you if you use it too prolifically.