I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.
<html>
<body>
<h1>draw electronics schematics</h1>
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>
<h2>second header</h2>
<p>
<!-- ..again some text and images -->
</p>
</body>
</html>
I read this html file using python and beautiful soup like this.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"))
pages = []
What I’d like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I’d like to store them in a list eg. pages. So I’d be able to create multiple pages from an html page according to <h2> tags.
Any ideas on how should I do this? Thanks..
Look for the
h2tags, then use.next_siblingto grab everything until it’s anotherh2tag:Using your sample, this gives: