In this output, why am I getting extra newlines after printing non-ASCII Unicode characters?
Platform is Windows Vista and problem occurs after chcp 65001 but not after chcp 850
C:\>chcp 850 Active code page: 850 C:\>perl unicode_bug_1.pl Budweiser Budweiser Budweiser Bud─øjovick├¢ Budvar Bud─øjovick├¢ Budvar Bud─øjovick├¢ Budvar C:\>chcp 65001 Active code page: 65001 C:\>perl unicode_bug_1.pl Budweiser Budweiser Budweiser Budějovický Budvar Budějovický Budvar Budějovický Budvar
from this program
#!perl
use strict;
use warnings;
binmode (STDOUT, "encoding(UTF-8)"); # so no "Wide character in print" warning
print "Budweiser\n" for 1..3;
print "Bud\N{U+011B}jovick\N{U+00FD} Budvar\n" for 1..3;
This seems to be a bug in Perl. I had thought it was a bug in Windows code page 65001 not really being supported for the console but I finally made test programs in C and Perl and the problem does not happen in the C version. It happens no matter where the Unicode character occurs in the line but the line you’re printing must be wider than the console supports.
Here is my C program:
And here is my Perl program:
UPDATE
No I was wrong, with the help of some of the guys at #perl on irc.perl.org it turns out to be a bug in the Microsoft API.
WriteFileis documented to return the number of bytes written but returns the number of characters written, which depends on the codepage. A bug was filed in March 2010.There is more discussion in the MSDN forums.
UPDATE 2
I posted Michael Kaplan’s blog, “Sorting it all out”, about this problem and he responded with the article entitled “Hidden in plain site: a purloined letter kind of a bug report”. He’s a Microsoft internationalization expert so you will surely find some insights there…