How would it be possible to do the following:
- Scan through an html page (preferably through a whole domain (www.python.org) and extract all
h1 h2 …hn Tags
and write all Headings to a file. In the correct order:
Start with h1
Than h2
until we reach the next h1
Given the requirement to scan a whole website, you might want to look into pycurl to grab the files to scrape. Be careful not to hit the site with the equivalent of a DoS attack though.