I am learning Python – Beautiful Soup by trying to scrape data. I have

Question

0

Asked: May 28, 20262026-05-28T14:41:26+00:00 2026-05-28T14:41:26+00:00

I am learning Python – Beautiful Soup by trying to scrape data. I have

0

I am learning Python – Beautiful Soup by trying to scrape data. I have a HTML page with this format…

span id listing-name-1
span class address
span preferredcontact="1"
a ID websiteLink1

span id listing-name-2
span class address
span preferredcontact="2"
a ID websiteLink2

span id listing-name-3
span class address
span preferredcontact="3"
a ID websiteLink3

and so on up to 40 such entries.

I would like to get the text that is present inside those classes/IDs in the same order how they are on that HTML page.

To kick start, I tried something like this to get the listing-name-1

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12")

soup = BeautifulSoup(page)

soup.find(span,attrs={"id=listing-name-1"})

It throws An existing connection was forcibly closed by the remote host error

I have no idea how to fix this. I need help on two things:

How to fix that error
How can I iterate the listing-name-1 from 1 through 40 ? I do not want to type in soup.find(span,attrs={"id=listing-name-1"}) for all 40 Span IDs.

Thank you!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T14:41:27+00:00

With lxml.html you can call parse directly with a url so you don’t have to call urllib yourself. Also, instead of using find or findall you’ll want to call xpath so you get the full expressiveness of xpath; if you tried calling the same expression below using find it will return an invalid predicate error.

#!/usr/bin/env python

import lxml.html

url = "http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12"
tree = lxml.html.parse(url)
listings = tree.xpath("//span[contains(@id,'listing-name-')]/text()")
print listings

will output this, preserving the order:

['Cape Cod Australia Pty Ltd',
'BHI',
'Fibrent Pty Ltd Building & Engineering Assessments',
 ...
'Archicentre']

To answer the question in your comments to my answer, what you want to search for is the <div class="listingInfoContainer">...</div> which contains all the info that you want. (the name, address, etc). Then you can loop over the list of div elements that match those criteria and use xpath expressions to extract the rest of the information. Note that in this case I use container.xpath('.//span') which will search from the current node (the container div), otherwise if you leave out the . and just have //span it will start the search from the top of the tree and you will get a list of all the elements that match, which is not what you want once you select the container node.

#!/usr/bin/env python

import lxml.html

url = "http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12"
tree = lxml.html.parse(url)
container = tree.xpath("//div[@class='listingInfoContainer']")
listings = []
for c in container:
    data = {}
    data['name'] = c.xpath('.//span[contains(@id,"listing")]/text()')
    data['address'] = c.xpath('.//span[@class="address"]/text()')
    listings.append(data)

print listings

which outputs:

[{'name': ['Cape Cod Australia Pty Ltd'], 
  'address': ['4th Floor 410 Church St, North Parramatta NSW 2151']}, 
 {'name': ['BHI'], 
  'address': ['Suite 5, 65 Doody St, Alexandria NSW 2015']}, 
 {'name': ['Fibrent Pty Ltd Building & Engineering Assessments'], 
  'address': ["Suite 3B, Level 1, 72 O'Riordan St, Alexandria NSW 2015"]}, 
  ...
 {'name': ['Archicentre'], 
  'address': ['\n                                         Level 3, 60 Collins St\n                                         ',
              '\n                                         Melbourne VIC 3000\n                                    ']}]

which is a list (again, preserving order the way you wanted) of dictionaries with the keys name and address that each contain a list. That final list is returned by text() which preserves the \n newline characters in the original html and translates things like <br> into a new list element. An example of why it does that is the list item, Archicentre, where the original HTML representation is:

<span class="address">
     Level 3, 60 Collins St
     <br/>
     Melbourne VIC 3000
</span>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am learning Python – Beautiful Soup by trying to scrape data. I have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply