I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the document. Any help, perhaps a short example would be helpful to me!
Share
Below is an example to do what you want. For an HTML document,
Cleaneris a better general solution to the problem than usingstrip_elements, because in cases like this you want to strip out more than just the<script>tag; you also want to get rid of things likeonclick=function()attributes on other tags.You can get a list of the options you can set in the lxml.html.clean.Cleaner documentation; some options you can just set to
TrueorFalse(the default) and others take a list like:Note that the difference between kill vs remove: