I want to parse a part of html page, say my_string = <p>Some text.

Question

0

Asked: June 7, 20262026-06-07T17:12:02+00:00 2026-06-07T17:12:02+00:00

I want to parse a part of html page, say my_string = <p>Some text.

0

I want to parse a part of html page, say

my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
   <a href="#">Link1</a>
   <a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""

I pass this string to BeautifulSoup:

soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template

But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don’t need those in my code. The only way I’ve found up to now to avoid this is to use html.parser.

I wonder if there is a way to get rid of redundant tags using lxml – the quickest parser.

UPDATE

Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T17:12:04+00:00

Editorial Team

2026-06-07T17:12:04+00:00Added an answer on June 7, 2026 at 5:12 pm

I could solve the problem using .contents property:

try:
    children = soup.body.contents
    string = ''
    for child in children:
        string += str(item)
    return string
except AttributeError:
    return str(soup)

I think that ''.join(soup.body.contents) would be more neat list to string converting, but this does not work and I get

TypeError: sequence item 0: expected string, Tag found

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to parse a part of html page, say my_string = <p>Some text.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply