I’m currently scraping a website and have all the useful data I need, although it comes with a bit of data that I don’t want.
Example:
<h2>Heading</h2>
<p>Useful <a href="/foo">data</a></p>
Rubbish <a href="/bar">data</a>
<h2>heading</h2>
So essentially I want to remove all text that is not enclosed by either h2 or p tags.
Is there an easy function/preg?
Results are alittle better: