I’m using JSoup to sanitize some untrusted HTML. I discovered that if I call
String html = "<div id='foo'><script type='text/javascript'>alert('hello');</script></div>";
String cleanedHtml = Jsoup.clean(html, Whitelist.relaxed());
At this point cleanedHtml is
<div><div>
So the <script> tag has correctly been removed, but mysteriously, so has the id attribute of the <div>. Is there any good reason why this should be removed or is it a bug?
By default the
idattribute is removed; add it as an allowable attribute:Is it a bug? Not AFAIC; it’s in the source. IMO there are documentation bugs, though.
Is there “any good reason” why this should be removed? Not sure about that one, but attributes like this aren’t structural: removing it doesn’t alter the DOM. That’s the thing about whitelists–they explicitly allow, and must be curated to match your precise needs.