I’m trying to open multiple pages following a certain format using mechanize. I want to start with a certain page, and have mechanize follow all the links that have a certain class or piece of text in a link. For example, the root url would be something like
http://hansard.millbanksystems.com/offices/prime-minister
and I want to follow every link on the page that has a format such as
<li class='office-holder'><a href="http://hansard.millbanksystems.com/people/mr-tony-blair">Mr Tony Blair</a> May 2, 1997 - June 27, 2007</li>
In other words, I want to follow every link that has the class ‘office-holder’ or that has /people/ in the URL. I’ve tried the following code, but it hasn’t worked.
import mechanize
br = mechanize.Browser()
response = br.open("http://hansard.millbanksystems.com/offices/prime-minister")
links = br.links(url_regex="/people/")
print links
I’m trying to print the links so I can make sure that I’m getting the right links/information before writing any more code. The error(?) I get from this is:
<generator object _filter_links at 0x10121e6e0>
Any pointers or tips are appreciated.
That’s not an error – it means that
Browser.links()returns an generator object rather than a list.An iterator is an object that acts “like a list”, meaning that you can do things like
and so on. But you can only access things in whatever order it defines; you can’t necessarily do
link[5], and once you’ve gone through the iterator, it’s used up.A generator is, for most purposes, just an iterator that doesn’t necessarily know all its results in advance. This is very useful in generator expressions, and you can actually write very simple functions that return generators with the yield keyword:
This is a Good Thing because it means that you don’t have to store all of your data in memory at once (which for
odds()would be impossible…), and if you only need the first few elements of the result you don’t have to bother computing the rest. Theitertoolsmodule has a bunch of handy functions for dealing with iterators.Anyway, if you just want to print out the contents of
links, you can turn it into a list with thelist()function (which takes an iterable and returns a list of its elements):or make a list of strings with a list comprehension:
or walk over its elements and print them out:
But note that after you do this,
linkswill be “exhausted” – so if you want to actually do anything with it, you’ll need to get it again.Maybe the simplest option is to immediately turn it into a list and not worry about it being an iterator at all:
Also, you’re obviously not yet getting links that have the class you want. There might be some
mechanizetrick to do an “or” here, but a nifty way to do it using sets and generator expressions would be something like this:Obviously replace
get_links_with_classwith the real way to get those links. Then you’ll end up with a set of all the link URLs that have/people/in their URL and/or have the classoffice-holder, with no duplicates. (Note that you can’t put theLinkobjects in the set directly because they’re not hashable.)