I am working on a project requiring me to scan through a large number of HTML-files (8000+). Some of these files are broken but this is an inevitable consequence of the source of the files, and cannot be fixed.
I have chosen to use BeautifulSoup4 to find and extract the data. The code for this is the following:
from bs4 import BeautifulSoup
data = open('data\file.html', encoding='utf-8')
soup = BeautifulSoup(data)
tag = soup.find('strong', text="Heading:")
split_tag = str(tag.next_sibling.next_element.next_element).split(", ")
What it does is that it opens a file, searches for a strong-tag containing the text “Heading:”. Then it splits the contents of this tag at the commas.
However, if the source file is broken it does not have a strong-tag containing the text “Heading:”. Therefore, an AttributeError is raised at split_tag due to the fact that it returns “None” and therefore does not have any next_sibling.
I tried to fix this by using the following method:
try:
split_tag = str(tag.next_sibling.next_element.next_element).split(", ")
except AttributeError:
pass
else:
split_tag = str(tag.next_sibling.next_element.next_element).split(", ")
This did not work. I also tried expressing this as a function but no luck.
So I turn to you. What I want to do is to split the contents at the commas if there are any contents. If not, the script should just pass.
I am very grateful for any assistance!
Here is how I would rewrite your exception handler. If there is no such heading, then we should expect to get a zero-length list of tags.
But in this case a simple
ifstatement should work well, becausesoup.find()is returningNonewhen nothing is found.When checking for
None, it is best to use theistest for object identity, as I showed above.