I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:
perl -pe “BEGIN { SOME_PREPARATION }; s/SRC/DST/g;” infile.txt > outfile.txt
Hex dump of input for testing (two lines: “a” and “b” letters on each):
FF FE 61 00 0A 00 62 00 0A 00
processing like s/b/c/g should give an output (“b” replaced with “c”):
FF FE 61 00 0A 00 63 00 0A 00
PS. Right now with all my trials either there’s a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same “a” on one line is 6100 on the odd lines and 0061 on the even lines in the output.
The best I’ve come up with is this:
But note that I had to use
<infile.txtinstead ofinfile.txtso that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magicARGVfilehandle, but I can’t get it to work correctly in this case.The difference between
<infile.txtandinfile.txtis in how and when the files are opened. With<infile.txt, the file is connected to standard input, and opened before Perl begins running. When youbinmode STDINin aBEGINblock, the file is already open, and you can change the encoding.When you use
infile.txt, the filename is passed as a command line argument and placed in the@ARGVarray. When theBEGINblock executes, the file is not open yet, so you can’t set its encoding. Theoretically, you ought to be able to say:and have the magic
<ARGV>processing apply the right encoding. But I haven’t been able to get that to work right in this case.