I have a simple html file like this. In fact I pulled it from

Question

0

Asked: June 17, 20262026-06-17T14:43:44+00:00 2026-06-17T14:43:44+00:00

I have a simple html file like this. In fact I pulled it from

0

I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.

<html>
   <body>
      <h1>draw electronics schematics</h1>
      <h2>first header</h2>
      <p>
         <!-- ..some text images -->
      </p>
      <h3>some header</h3>
      <p>
         <!-- ..some image -->
      </p>
      <p>
         <!-- ..some text -->
      </p>
      <h2>second header</h2>
      <p>
         <!-- ..again some text and images -->
      </p>
   </body>
</html>

I read this html file using python and beautiful soup like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"))

pages = []

What I’d like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I’d like to store them in a list eg. pages. So I’d be able to create multiple pages from an html page according to <h2> tags.

Any ideas on how should I do this? Thanks..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T14:43:46+00:00

Look for the h2 tags, then use .next_sibling to grab everything until it’s another h2 tag:

soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')

def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem

for h2tag in h2tags:
    page = [str(h2tag)]
    elem = next_element(h2tag)
    while elem and elem.name != 'h2':
        page.append(str(elem))
        elem = next_element(elem)
    pages.append('\n'.join(page))

Using your sample, this gives:

>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a simple html file like this. In fact I pulled it from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply