With a great deal of help from the stackoverflow community I am learning a

Question 1

With a great deal of help from the stackoverflow community I am learning a lot about python and specifically scraping with BeautifulSoup. I am referring again to the same example page I am using to learn my way through this.

I have the following code:

from bs4 import BeautifulSoup
import re

f = open('webpage.txt', 'r')
g = f.read()
soup = BeautifulSoup(g)

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        print ic_next_td

which from this page:

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

saved as webpage.txt gives me the following result:

ACN
007 350 807
ABN
71 007 350 807
Annual Turnover
$5M - $10M
Number of Employees
6-10
QA
ISO9001-2008, AS9120B,
Export Percentage
5 %
Industry Categories
<td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td>
Company Email
lisa@aerospacematerials.com.au
Company Website
http://www.aerospacematerials.com.au
Office
2/6 Ovata Drive Tullamarine VIC 3043
Post
PO Box 188 TullamarineVIC 3043
Phone
+61.3. 9464 4455
Fax
+61.3. 9464 4422

so far, so good. Will look at writing this to a CSV or something later, but for now I am wondering how to break out the data contained in <td class="bodytext">Aerospace Land (Vehicles, etc) Logistics Marine Procurement </td> to separate lines?

Like this:

Industry Categories
Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

I have tried a bit of regex such as:

if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        value = re.findall('\>(.*?)\<', ic_next_td)
        print value[0]

But I get the eroor TypeError: expected string or buffer. Am thinking I need to iterate over the findall or something too.

The method needs to be general enough to handle other variations in the same format, such as ‘Donkey’ or ‘Boat’ instead of ‘Aerospace’ or ‘Logistics’ (i will not necessarily know all the possibilities up front in the scenario I am thinking about).

Is there a way to pull this out using the br tag and Beautiful soup, or a regex?

Sorry this is a bit long. As always, also very happy for any suggested code optimisations as well so I can continue to learn the best way to build Python scripts correctly.

Thank you!

Update

This code worked:

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
                print value

and this code produced an indentation error:

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
            print value

note seemingly double indentation of print value in working code. Seemed to me the next level of indentation would be a single indent after for value in ic_next_td.strings:?

Question 2

You’ll have to parse out the contents of ic_next_td a little further. Luckily, the original page uses   tags to give you places to delimit the text. Don’t bother with a regex here, BeautifulSoup has better tools for you:

for value in ic_next_td.strings:
    print value

would result in:

Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

You can store all these in a list by calling list() on the .strings iterator:

values = list(ic_next_td.strings)

Editorial Team · Answer 1 · 2026-06-14T12:06:03+00:00

You’ll have to parse out the contents of ic_next_td a little further. Luckily, the original page uses   tags to give you places to delimit the text. Don’t bother with a regex here, BeautifulSoup has better tools for you:

for value in ic_next_td.strings:
    print value

would result in:

Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

You can store all these in a list by calling list() on the .strings iterator:

values = list(ic_next_td.strings)

Editorial Team
2026-06-14T12:06:03+00:00Added an answer on June 14, 2026 at 12:06 pm

You’ll have to parse out the contents of ic_next_td a little further. Luckily, the original page uses   tags to give you places to delimit the text. Don’t bother with a regex here, BeautifulSoup has better tools for you:

for value in ic_next_td.strings: print value

would result in:

Aerospace Land (Vehicles, etc) Logistics Marine Procurement

You can store all these in a list by calling list() on the .strings iterator:

values = list(ic_next_td.strings)

0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report — Editorial Team, 2026-06-14T12:06:03+00:00Added an answer on June 14, 2026 at 12:06 pm

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

With a great deal of help from the stackoverflow community I am learning a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply