I want to use sed to take any arbitrary stream and append a null byte to each byte.
I’ve tried a number of things, but have trouble with:
- matching any byte –
.seems to be a subset, i.e. any character, not any byte. - adding a null byte – I thought it should be
\0, but that doesn’t work.
Answer for original question
I suggest using Perl or Python; here’s a (verbose) Perl solution:
For ASCII text input, this gives you UTF-16LE output (without a BOM). Given that it is Perl, TMTOWTDI, and it can be reduced to a one-line; see the answer by paxdiablo.
Given this explicit loop structure, the easiest way to print the BOM is to add a print statement before the loop:
Given a one-liner, you need a BEGIN block:
There are at least 4, arguably 5, superfluous spaces in that script.
Answer for revised then reverted question
The modified question was:
The revised question is a very different proposition from the original. Converting UTF-8 to UTF-16 is, in general, moderately complex; you have to read 1-4 bytes of input, and generate 2 or 4 bytes of output, worrying about surrogates and malformed input, etc. The original question – how to add a NUL (or zero) byte after each character in the input – is much, much, much simpler. (It remains true that if the input is ASCII – 7-bit byte values between 0 and 127 – then the ‘add a NUL afterwards’ gives you UTF-16LE. But only if the UTF-8 data is in the ASCII subset.)
However, for accurate translation, the tool of choice should be
iconv:Hence, to convert from UTF-8 to UTF-16LE:
Interestingly, I don’t see an option to add a BOM to the output, at least not with
iconvversion 1.11 from 2007 on RHEL 5 (nor the same version on MacOS X, dated 2006 — don’t ask, I don’t know!).