I have many HTML documents containing many HTML entities of Unicode code point representation, e.g. بروح
Is there a good tool to convert HTML entities in multiple HTML documents to plain UTF-8/UTF-16/UTF-32 characters?
I want an offline converter tool that can do a batch job for this purpose.
The GNU utility “recode” will do this, with the invocation
(or UTF-16BE, of course.)
http://ftp.gnu.org/gnu/recode/recode-3.6.tar.gz
It’s use of HTML as a character set is a bit of a hack and is treated as either ASCII or LATIN-1, when it should be treated as a “surface” for any character set. If there are any UTF-8 characters, it can break, so I’m now withdrawing my recommendation. Use the first.
(You might expect
recode UTF-8..HTML,HTML..UTF-16LEto work, but this first encodes the ampersands…)