I am using Perl to read UTF-16LE files in Windows 7.
If I read in an ASCII file with following code then each “\r\n” in file will be converted into a “\n” in memory:
open CUR_FILE, "<", $asciiFile;
If I read in an UTF-16LE(windows 1200) file with following code, this inconsistency cause problems when I trying to regexp lines with line breaks.
open CUR_FILE, "<:encoding(UTF-16LE)", $utf16leFile;
Then “\r\n” will keep unchanged.
Update:
For each line of a UTF-16LE file:
line =~ /(.*)$/
Then the string matched in $1 will include a “\r” at the end…
What version of Perl are you using? UTF-16 and CRLF handling did not mix properly before 5.8.9 (Unicode changes in 5.8.9). I’m not sure about 5.10.0, but it works in 5.10.1 and 5.8.9. You might need to use
"<:encoding(UTF-16LE):crlf"when opening the file.