I am using BeautifulSoup in a simple function to extract links that have all

Question

0

Asked: May 17, 20262026-05-17T23:47:34+00:00 2026-05-17T23:47:34+00:00

I am using BeautifulSoup in a simple function to extract links that have all

0

I am using BeautifulSoup in a simple function to extract links that have all uppercase text:

def findAllCapsUrls(page_contents):
    """ given HTML, returns a list of URLs that have ALL CAPS text
    """
    soup = BeautifulSoup.BeautifulSoup(page_contents)
    all_urls = node_with_links.findAll(name='a')

    # if the text for the link is ALL CAPS then add the link to good_urls
    good_urls = []
    for url in all_urls:
        text = url.find(text=True)
        if text.upper() == text:
            good_urls.append(url['href'])

    return good_urls

Works well most of the time, but a handful of pages will not parse correctly in BeautifulSoup (or lxml, which I also tried) due to malformed HTML on the page, resulting in an object with no (or only some) links in it. A “handful” might sound like not-a-big-deal, but this function is being used in a crawler so there could be hundreds of pages that the crawler will never find…

How can the above function be refactored to not use a parser like BeautifulSoup? I’ve searched around for how to do this using regex, but all the answers say “use BeautifulSoup.” Alternatively, I started looking at how to “fix” the malformed HTML so that is parses, but I don’t think that is the best route…

What is an alternative solution, using re or something else, that can do the same as the function above?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T23:47:34+00:00

I ended up with a combination of regex and BeautifulSoup:

def findAllCapsUrls2(page_contents):
    """ returns a list of URLs that have ALL CAPS text, given
    the HTML from a page. Uses a combo of RE and BeautifulSoup
    to handle malformed pages.
    """
    # get all anchors on page using regex
    p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
    re_urls = re.compile(p, re.DOTALL)
    all_a = re_urls.findall(page_contents)

    # if the text for the anchor is ALL CAPS then add the link to good_urls
    good_urls = []
    for a in all_a:
        href = a[0]
        a_content = a[1]
        a_soup = BeautifulSoup.BeautifulSoup(a_content)
        text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
        if text and text.upper() == text:
            good_urls.append(href)

    return good_urls

This is working for my use cases so far, but I wouldn’t guarantee it to work on all pages. Also, I only use this function if the original one fails.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using BeautifulSoup in a simple function to extract links that have all

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply