I am trying to learn about web scraping and python (and programming for that

Question

0

Asked: June 14, 20262026-06-14T05:57:20+00:00 2026-06-14T05:57:20+00:00

I am trying to learn about web scraping and python (and programming for that

0

I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities.

I am trying to find out how to best pull the pertinent information from this page:

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c.

At this stage looking at how to cleanly isolate this data and scrape it, with the view to put it all in a CSV or something later.

I am confused how to use BS to grab the different table data. There are lots of tr and td tags and not sure how to anchor on to anything unique.

The best I have come up with is the following code as a start:

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie

and then from there use regex e.t.c. to pull data from the cleaned up text.

But there must be a better way to do this using the BS tree? Or is this site formatted in a way that BS won’t provide much more help?

Not looking for a full solution as that is a big ask and I want to learn, but any code snippets to get me on my way would be much appreciated.

Update

Thanks to @ZeroPiraeus below I am starting to understand how to parse through the tables. Here is the output from his code:

=== Personnel ===
bodytext    Ms Gail Morgan CEO
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext    Lisa Mayoh Sales Manager
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au

=== Company Details ===
bodytext    ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext    007 350 807
paraheading ABN:
bodytext    71 007 350 807
paraheading 
bodytext    Australian Owned
paraheading Annual Turnover:
bodytext    $5M - $10M
paraheading Number of Employees:
bodytext    6-10
paraheading QA:
bodytext    ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext    5 %
paraheading Industry Categories:
bodytext    AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext    lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext    http://www.aerospacematerials.com.au
paraheading Office:
bodytext    2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext    PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext    +61.3. 9464 4455
paraheading Fax:
bodytext    +61.3. 9464 4422

My next question is, what is the best way to put this data into a CSV which would be suitable for importing into a spreadsheet? For example having things like ‘ABN’ ‘ACN’ ‘Company Website’ e.t.c. as column headings and then the corresponding data as row information.

Thanks for any help.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T05:57:21+00:00

Your code will depend on exactly what you want and how you want to store it, but this snippet should give you an idea how you can get the relevant information out of the page:

import requests

from bs4 import BeautifulSoup

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
    print "\n=== %s ===" % feature_heading.text
    details = feature_heading.find_next_sibling("td")
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
        print("\t".join([item["class"][0], " ".join(item.text.split())]))

I find requests a more pleasant library to work with than urllib2, but of course that’s up to you.

EDIT:

In response to your followup question, here’s something you could use to write a CSV file from the scraped data:

import csv
import requests

from bs4 import BeautifulSoup

columns = ["ACN", "ABN", "Annual Turnover", "QA"]
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc.

with open("data.csv", "w") as csv_file:
    writer = csv.DictWriter(csv_file, columns)
    writer.writeheader()
    for url in urls:
        soup = BeautifulSoup(requests.get(url).text)
        row = {}
        for heading in soup.find_all("td", {"class": "paraheading"}):
            key = " ".join(heading.text.split()).rstrip(":")
            if key in columns:
                next_td = heading.find_next_sibling("td", {"class": "bodytext"})
                value = " ".join(next_td.text.split())
                row[key] = value
        writer.writerow(row)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to learn about web scraping and python (and programming for that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply