We are migrating from one content system into another and have tons of HTML where there are lines, for example, like this:
<p style="text-align: justify;"><i> </i></p>
I am looking for a way to strip HTML with Python where there is no text output to the screen. So a line similar to this would be stripped.
And, this is just one of MANY examples of lines where there is no text output. So, I would need to find them all to strip. I don’t have to worry about images, movies, etc. since only text was possible in our old content management system.
BTW, the vast majority of the lines either start with a p tag or a div tag (ignoring leading whitespace).
In case the HTML is also a well-formed XML document (This can be done in a pre-pass with a tool like HTML-Tidy), this transformation:
when applied on any such XML document — for example:
produces the wanted result in which any element whose string value is empty or is all whitespace, is deleted: