I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters.
Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?
I need only CR and LF to work.
Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences …
None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.
And the escape sequence characters to switch back and forth between various character sets are:
As you can see, none of the characters used to encode Japanese characters in ISO-2022-JP overlap with CR or LF.
Again, there is no overlap with CR and LF.