I am trying to parse html using BeautifulSoup to try and extract the webpage

Question

0

Asked: June 6, 20262026-06-06T10:04:53+00:00 2026-06-06T10:04:53+00:00

I am trying to parse html using BeautifulSoup to try and extract the webpage

0

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When this does not work I go to manual regex

I have the text

<html xmlns="http://www.w3.org/1999/xhtml"\n      xmlns:og="http://ogp.me/ns#"\n      xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n    <title>\n                    .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n            </title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

And I am trying to grab the values between the <title> and </title> tags. It should be fairly simple, but it is not working. Here’s my python code for it.

result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
    title = result.group(0)

This does not work on this text for whatever reason. It returns result.group() as None or I get an AttributeError. AttributeError: ‘NoneType’ object has no attribute ‘groups’

I’ve C&P’d this text into online python regex developers and tried all the options (re.match, re.findall, re.search) and they work there but for whatever reason in my script it is not able to find anything between these tags. Even trying other regex such as

<title>(.*?)</title>

etc

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T10:04:54+00:00

Editorial Team

2026-06-06T10:04:54+00:00Added an answer on June 6, 2026 at 10:04 am

You should use the dotall flag to make the . match newline characters as well.

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

As the documentation says:

…without this flag, '.' will match anything except a newline

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse html using BeautifulSoup to try and extract the webpage

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply