I am trying to clean HTML text and to extract plain text from it

Question

0

Asked: May 30, 20262026-05-30T18:17:50+00:00 2026-05-30T18:17:50+00:00

I am trying to clean HTML text and to extract plain text from it

0

I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.

For example the HTML text is:

String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";

Now if I use Jsoup#parse(String html):

String text = Jsoup.parse(html).text();

It is printing:

Á example link.

And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):

String text = Jsoup.clean(html, Whitelist.none());

It is printing:

&Aacute; example link.

My question is, how can I get the text

Á example link.

using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).

Any information will be very helpful to me.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T18:17:51+00:00

Editorial Team

2026-05-30T18:17:51+00:00Added an answer on May 30, 2026 at 6:17 pm

Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no “don’t escape” mode now (check Entities.EscapeMode).

You can 1. unescape these HTML entities, 2. extend jsoup’s source code by adding a new escape mode with an empty map.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to clean HTML text and to extract plain text from it

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply