If you are looking at the UTF8 flag in application code then your code is broken. Period.

The flag does NOT mean that the string contains character data. It ONLY means that the string contains a character > 255 either now or could have, at some point in the past.

For strings that contain characters > 127 < 256, you have no idea whatsoever about whether the string contains characters or bytes regardless of what the UTF-8 flag says. (For strings with only characters < 128 there is no difference whether they are ASCII or binary, and strings with any characters > 255, must be either all characters – or corrupt (character and binary data mixed together).)

The following will produce a perfect copy of lolcat.jpg in lolol.jpg :

my $funneh = do { local ( @ARGV, $/ ) = 'lolcat.jpg'; <> }; utf8::upgrade( $funneh ); # now the utf8 flag on $funneh is set! open my $copy, '>', 'lolol.jpg' or die $!; print $copy $funneh;

It makes no difference at all that the data was internally UTF-8-encoded at one point! Because those UTF-8-encoded string elements still represented bytes. And the UTF8 flag would not – because it could not, because that’s not what it means – (not) tell you that this is binary data.

So don’t ask it that.

The flag tells perl how the data is stored, it does not tell Perl code what it means. Strings in Perl are, essentially, simply sequences of arbitrarily large integers that know nothing of characters or bytes; and perl has two different internal formats for storing such integer sequences – a compact random-access one that cannot store integers > 255, and a variable-width one that can store anything and just so happens to use the same encoding as UTF-8 because that is convenient. The UTF8 flag just says which of these two formats a string uses. (It has been pointed out many times over the years that the flag should have been called UOK instead, in keeping with the other internal flags.)

If you check the perldelta you will in fact find that the major innovation that started in Perl 5.12 and largely completed in 5.14 is to (finally) stop the regex engine from making this exact mistake, namely, it no longer derives semantics about a string from its internal representation.

Whether any particular packed integer sequence represents bytes or characters, in Perl, is for the programmer to know. The strings themselves carry no knowledge of what they are. If you are designing an API in which bytes vs characters is a concern, then in each case it accepts a string you must decide whether you prefer bytes or characters there, then document that this is what you expect, and then write the code so it always treats that string the same way – either always as characters or always as bytes. No trying to magically do the right thing: you can’t.