So I am doing some data analysis in which I am required to extract the page title, breadcrumb, h1 tags from hundreds of HTML and SHTML files.
Those tags are in the following format (meaning stuffs inside , and breadcrumb):
<title>Mapping a Drive: Macintosh OSX < Mapping a Drive < eHelp < Cal Poly Pomona</title>
<p><!-- InstanceBeginEditable name="breadcrumb" --><a href="../index.html">eHelp</a> » <a href="index.shtml">Mapping a Drive</a> » Mac OS X<!-- InstanceEndEditable --></p>
<h1><a name="contentstart" id="contentstart"></a><!-- InstanceBeginEditable name="page_heading" --><a name="top" id="top"></a>Mapping a Drive:<span class="goldletter"> Macintosh </span>OS X <!-- InstanceEndEditable --></h1>
After getting those tags, I want to further extract the first part of the title Mapping a Drive: Macintosh OSX, last part of the breadcrumb Mac OS X and the whole h1 Mapping a Drive: Macintosh OSX
Any idea how that can be accomplished?
Use a real HTML parser, not a regex. You will be happier.
lxml.htmlis highly regarded, as isBeautifulSoup.