Update

As @ikegami suggested, I reported this as a bug.

Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output

Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:

C version:

#include <stdio.h> int main(void) { /* UTF-8 encoded alpha, beta, gamma */ char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 }; puts(x); return 0; }

C:\…> chcp 65001 Active code page: 65001 C:\…> cttt.exe αβγ

Perl version:

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3

}" αβγ �

Output:

From what I can tell, the last octet, 0xb3 is being output again, on another line, which is being translated to U+FFFD .

Note that redirecting output eliminates this effect.

I can also verify that it is the last octet being repeated:

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz

}" αβγxyz z

On the other hand, syswrite avoids this problem.

C:\…> perl -e "syswrite STDOUT, qq{\xce\xb1\xce\xb2\xce\xb3xyz

}" αβγxyz

I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.

I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.

Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:

C:\…> chcp 437 Active code page: 437 C:\…> cttt.exe ╬▒╬▓╬│ C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3

}" ╬▒╬▓╬│

What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?

PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.

PPS: Leaving out the

results in something even more interesting:

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz}" αβγxyzxyz

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3}" αβγ�γ�