When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work.
I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE.
Any help on this?
UTF-16LEtellsiconvto generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specifiedLE, the BOM isn’t necessary.UTF-16tells it to generate UTF-16 text (in the local machine’s byte order) with a BOM.If you’re on a little-endian machine, I don’t see a way to tell
iconvto generate big-endian UTF-16 with a BOM, but I might just be missing something.I find that the
filecommand doesn’t recognize UTF-16 text without a BOM, and your editor might not either. But if you runiconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.Try running
od -con the files to see their actual contents.UPDATE :
It looks like you’re on a big-endian machine (x86 is little-endian), and you’re trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell,
iconvwon’t do that directly. But this should work:The behavior of the
printfmight depend on your locale settings; I haveLANG=en_US.UTF-8.(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by
-t utf-16: