Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8846859
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T12:06:02+00:00 2026-06-14T12:06:02+00:00

With a great deal of help from the stackoverflow community I am learning a

  • 0

With a great deal of help from the stackoverflow community I am learning a lot about python and specifically scraping with BeautifulSoup. I am referring again to the same example page I am using to learn my way through this.

I have the following code:

from bs4 import BeautifulSoup
import re

f = open('webpage.txt', 'r')
g = f.read()
soup = BeautifulSoup(g)

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        print ic_next_td

which from this page:

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

saved as webpage.txt gives me the following result:

ACN
007 350 807
ABN
71 007 350 807
Annual Turnover
$5M - $10M
Number of Employees
6-10
QA
ISO9001-2008, AS9120B,
Export Percentage
5 %
Industry Categories
<td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td>
Company Email
lisa@aerospacematerials.com.au
Company Website
http://www.aerospacematerials.com.au
Office
2/6 Ovata Drive Tullamarine VIC 3043
Post
PO Box 188 TullamarineVIC 3043
Phone
+61.3. 9464 4455
Fax
+61.3. 9464 4422

so far, so good. Will look at writing this to a CSV or something later, but for now I am wondering how to break out the data contained in <td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td> to separate lines?

Like this:

Industry Categories
Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

I have tried a bit of regex such as:

if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        value = re.findall('\>(.*?)\<', ic_next_td)
        print value[0]

But I get the eroor TypeError: expected string or buffer. Am thinking I need to iterate over the findall or something too.

The method needs to be general enough to handle other variations in the same format, such as ‘Donkey’ or ‘Boat’ instead of ‘Aerospace’ or ‘Logistics’ (i will not necessarily know all the possibilities up front in the scenario I am thinking about).

Is there a way to pull this out using the br tag and Beautiful soup, or a regex?

Sorry this is a bit long. As always, also very happy for any suggested code optimisations as well so I can continue to learn the best way to build Python scripts correctly.

Thank you!

Update

This code worked:

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
                print value

and this code produced an indentation error:

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
            print value

note seemingly double indentation of print value in working code. Seemed to me the next level of indentation would be a single indent after for value in ic_next_td.strings:?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T12:06:03+00:00Added an answer on June 14, 2026 at 12:06 pm

    You’ll have to parse out the contents of ic_next_td a little further. Luckily, the original page uses <br/> tags to give you places to delimit the text. Don’t bother with a regex here, BeautifulSoup has better tools for you:

    for value in ic_next_td.strings:
        print value
    

    would result in:

    Aerospace
    Land (Vehicles, etc)
    Logistics
    Marine
    Procurement
    

    You can store all these in a list by calling list() on the .strings iterator:

    values = list(ic_next_td.strings)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have gotten a great deal of help from KandadaBoggu on my last question
I'm having a great deal of trouble using my c++ code from Visual C++
I am performing a great deal of inserts from a detail table into a
I have read a great deal of discussions about javascript templating and Search Engine
Scenario: We have a great deal of server environmental information (names, IPs, roles, firewall
I'm having a great deal of difficulty trying to figure out the logic behind
I am having a great deal of trouble with client side validation using JavaScript
I spent a great deal of time figuring out how to determine if my
I have a colleague in my company whose opinions I have a great deal
How would I go about updating an object declared in the RootViewController from my

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.