I am using BeautifulSoup in a simple function to extract links that have all uppercase text:
def findAllCapsUrls(page_contents):
""" given HTML, returns a list of URLs that have ALL CAPS text
"""
soup = BeautifulSoup.BeautifulSoup(page_contents)
all_urls = node_with_links.findAll(name='a')
# if the text for the link is ALL CAPS then add the link to good_urls
good_urls = []
for url in all_urls:
text = url.find(text=True)
if text.upper() == text:
good_urls.append(url['href'])
return good_urls
Works well most of the time, but a handful of pages will not parse correctly in BeautifulSoup (or lxml, which I also tried) due to malformed HTML on the page, resulting in an object with no (or only some) links in it. A “handful” might sound like not-a-big-deal, but this function is being used in a crawler so there could be hundreds of pages that the crawler will never find…
How can the above function be refactored to not use a parser like BeautifulSoup? I’ve searched around for how to do this using regex, but all the answers say “use BeautifulSoup.” Alternatively, I started looking at how to “fix” the malformed HTML so that is parses, but I don’t think that is the best route…
What is an alternative solution, using re or something else, that can do the same as the function above?
I ended up with a combination of regex and BeautifulSoup:
This is working for my use cases so far, but I wouldn’t guarantee it to work on all pages. Also, I only use this function if the original one fails.