I want to parse a part of html page, say
my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
<a href="#">Link1</a>
<a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""
I pass this string to BeautifulSoup:
soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template
But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don’t need those in my code. The only way I’ve found up to now to avoid this is to use html.parser.
I wonder if there is a way to get rid of redundant tags using lxml – the quickest parser.
UPDATE
Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.
I could solve the problem using .contents property:
I think that
''.join(soup.body.contents)would be more neat list to string converting, but this does not work and I get