With a great deal of help from the stackoverflow community I am learning a lot about python and specifically scraping with BeautifulSoup. I am referring again to the same example page I am using to learn my way through this.
I have the following code:
from bs4 import BeautifulSoup
import re
f = open('webpage.txt', 'r')
g = f.read()
soup = BeautifulSoup(g)
for heading in soup.find_all("td", class_="paraheading"):
key = " ".join(heading.text.split()).rstrip(":")
if key in columns:
print key
next_td = heading.find_next_sibling("td", class_="bodytext")
value = " ".join(next_td.text.split())
print value
if key == "Industry Categories":
print key
ic_next_td = heading.find_next_sibling("td", class_="bodytext")
print ic_next_td
which from this page:
http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113
saved as webpage.txt gives me the following result:
ACN
007 350 807
ABN
71 007 350 807
Annual Turnover
$5M - $10M
Number of Employees
6-10
QA
ISO9001-2008, AS9120B,
Export Percentage
5 %
Industry Categories
<td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td>
Company Email
lisa@aerospacematerials.com.au
Company Website
http://www.aerospacematerials.com.au
Office
2/6 Ovata Drive Tullamarine VIC 3043
Post
PO Box 188 TullamarineVIC 3043
Phone
+61.3. 9464 4455
Fax
+61.3. 9464 4422
so far, so good. Will look at writing this to a CSV or something later, but for now I am wondering how to break out the data contained in <td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td> to separate lines?
Like this:
Industry Categories
Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement
I have tried a bit of regex such as:
if key == "Industry Categories":
print key
ic_next_td = heading.find_next_sibling("td", class_="bodytext")
value = re.findall('\>(.*?)\<', ic_next_td)
print value[0]
But I get the eroor TypeError: expected string or buffer. Am thinking I need to iterate over the findall or something too.
The method needs to be general enough to handle other variations in the same format, such as ‘Donkey’ or ‘Boat’ instead of ‘Aerospace’ or ‘Logistics’ (i will not necessarily know all the possibilities up front in the scenario I am thinking about).
Is there a way to pull this out using the br tag and Beautiful soup, or a regex?
Sorry this is a bit long. As always, also very happy for any suggested code optimisations as well so I can continue to learn the best way to build Python scripts correctly.
Thank you!
Update
This code worked:
for heading in soup.find_all("td", class_="paraheading"):
key = " ".join(heading.text.split()).rstrip(":")
if key in columns:
print key
next_td = heading.find_next_sibling("td", class_="bodytext")
value = " ".join(next_td.text.split())
print value
if key == "Industry Categories":
print key
ic_next_td = heading.find_next_sibling("td", class_="bodytext")
for value in ic_next_td.strings:
print value
and this code produced an indentation error:
for heading in soup.find_all("td", class_="paraheading"):
key = " ".join(heading.text.split()).rstrip(":")
if key in columns:
print key
next_td = heading.find_next_sibling("td", class_="bodytext")
value = " ".join(next_td.text.split())
print value
if key == "Industry Categories":
print key
ic_next_td = heading.find_next_sibling("td", class_="bodytext")
for value in ic_next_td.strings:
print value
note seemingly double indentation of print value in working code. Seemed to me the next level of indentation would be a single indent after for value in ic_next_td.strings:?
You’ll have to parse out the contents of
ic_next_tda little further. Luckily, the original page uses<br/>tags to give you places to delimit the text. Don’t bother with a regex here, BeautifulSoup has better tools for you:would result in:
You can store all these in a list by calling
list()on the.stringsiterator: