I’m trying to parse the links from google search results and end up with weird output.
import mechanize, re, lxml.html
from lxml.html import parse
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.set_handle_robots(False)
url = 'https://www.google.com/search?q=test&gl=US'
response = br.open(url)
html = response.read().lower()
doc = lxml.html.document_fromstring(html)
for t in doc.xpath("//h3[@class='r']/a"):
print t.get('href')
which results in the following output:
Any help would be great,
Thanks
It’s not exactly clear what you’re trying to achieve here, because you’re getting exactly what you’re asking for there.
You’re getting the
hrefattribute of the inneratag, which comes out to:But more likely you’re looking for the link text and the link link. The URL that you’ll be sent to, without the Google special url stuff is in the
citeelement, and the link text is in theaelement you’ve already found.