I am learning Python – Beautiful Soup by trying to scrape data. I have a HTML page with this format…
span id listing-name-1
span class address
span preferredcontact="1"
a ID websiteLink1
span id listing-name-2
span class address
span preferredcontact="2"
a ID websiteLink2
span id listing-name-3
span class address
span preferredcontact="3"
a ID websiteLink3
and so on up to 40 such entries.
I would like to get the text that is present inside those classes/IDs in the same order how they are on that HTML page.
To kick start, I tried something like this to get the listing-name-1
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12")
soup = BeautifulSoup(page)
soup.find(span,attrs={"id=listing-name-1"})
It throws An existing connection was forcibly closed by the remote host error
I have no idea how to fix this. I need help on two things:
- How to fix that error
- How can I iterate the listing-name-1 from 1 through 40 ? I do not want to type in
soup.find(span,attrs={"id=listing-name-1"})for all 40 Span IDs.
Thank you!
With
lxml.htmlyou can callparsedirectly with a url so you don’t have to callurllibyourself. Also, instead of usingfindorfindallyou’ll want to callxpathso you get the full expressiveness of xpath; if you tried calling the same expression below usingfindit will return aninvalid predicateerror.will output this, preserving the order:
To answer the question in your comments to my answer, what you want to search for is the
<div class="listingInfoContainer">...</div>which contains all the info that you want. (the name, address, etc). Then you can loop over the list of div elements that match those criteria and use xpath expressions to extract the rest of the information. Note that in this case I usecontainer.xpath('.//span')which will search from the current node (the container div), otherwise if you leave out the.and just have//spanit will start the search from the top of the tree and you will get a list of all the elements that match, which is not what you want once you select the container node.which outputs:
which is a list (again, preserving order the way you wanted) of dictionaries with the keys
nameandaddressthat each contain a list. That final list is returned bytext()which preserves the\nnewline characters in the original html and translates things like<br>into a new list element. An example of why it does that is the list item, Archicentre, where the original HTML representation is: