I’m using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn’t, it ends up adding some HTML to close it off, which I don’t want.
Example: If someone enters in the text <integer>, my code ends up spitting out <integer></integer> instead of just <integer>
Here’s the code (value is the HTML string and VALID_TAGS is just a list of acceptable tag names).
soup = BeautifulSoup.BeautifulSoup(
value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
if tag.name not in VALID_TAGS:
tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()
Thanks in advance.
Figured this out using html5lib based on this answer as a starting point. Here’s a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the
<integer>case I described:Thanks to everyone who helped.