I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let’s say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements()always assumes<html>,<head>and<body>. Yes, I can calldocument.body().getAllElements()but the point is that I don’t know if my source is a full HTML document or just the body — and I want the result in the same shape and form as it came in;- how do I replace
<script>...</script>with<script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc.Node.replaceWithsounds like an overkill for this. - Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don’t suggest my desired functionality is supported.
Answer 1
How do you load / parse your
Documentwith Jsoup? If you useparse()orconnect().get()jsoup will automaticly format your html (insertinghtml,bodyandheadtags). This this ensures you always have a complete Html document – even if input isnt complete.Let’s assume you only want to clean an input (no furhter processing) you should use
clean()instead the previous listed methods.Example 1 – Using parse()
Output:
Input html is completed to ensure you have a complete document.
Example 2 – Using clean()
Output:
Input html is cleaned, not more.
Documentation:
Answer 2
The method
replaceWith()does exactly what you need:Example:
Output:
Or body only:
Output:
Documentation:
Answer 3
Yes,
prettyPrint()method ofJsoup.OutputSettingsdoes this.Example:
Note: if the
outputSettings()method is not available, please update Jsoup.Output:
Documentation:
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!