I have strings like:
Avery® Laser & Inkjet Self-Adhesive
I need to convert them to
Avery Laser & Inkjet Self-Adhesive.
I.e. remove special characters and convert html special chars to regular ones.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
First use
StringEscapeUtils#unescapeHtml4()(or#unescapeXml(), depending on the original format) to unescape the&into a&. Then useString#replaceAll()with[^\x20-\x7e]to get rid of characters which aren’t inside the printable ASCII range.Summarized:
..which produces
(without the trailing dot as in your example, but that wasn’t present in the original 😉 )
That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The
®namely look like to be caused by using the wrong encoding to read the string in and the&look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.