I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven’t found anything except some half-baked implementations in PHP, etc.
For example, something like
<p> <ul> <li>Foo
becomes
<p> <ul> <li>Foo</li> </ul> </p>
Any help would be appreciated 🙂
using BeautifulSoup:
gets you
As far as I know, you can’t control putting the <li></li> tags on separate lines from Foo.
using Tidy:
gets you
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
comes out as
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
becomes
All of these have their ups and downs, but hopefully one of them is close enough.