I’m implementing a network client that sends messages to a server. The messages are streams of bytes, and the protocol requires that I send the length of each stream beforehand.
If the message that I am given (by the code using my module) is a byte string, then the length is given easily enough by length $string. But if it’s a string of characters, I’ll need to massage it to get the raw bytes. What I’m doing now is basically this:
my $msg = shift; # some message from calling code
my $bytes;
if ( utf8::is_utf8( $msg ) ) {
$bytes = Encode::encode( 'utf-8', $msg );
} else {
$bytes = $msg;
}
my $length = length $bytes;
Is this the correct way to handle this? It seems to work so far, but I haven’t done any serious testing yet. What potential pitfalls are there with this approach?
Thanks
You shouldn’t really be guessing at what your input is. Define your code to accept either byte strings or Unicode character strings, and leave it to the caller to convert the input to the proper format (or provide some way for the caller to specify which kind of strings they’re providing).
If you define your code to accept byte strings, then any characters above
\xFFare an error.If you define your code to accept Unicode character strings, then you can convert them to bytes with
Encode::encode_utf8()(and should do so regardless of how they’re internally represented by Perl).In any case, calling
utf8::is_utf8()is usually a mistake — your program should not care about the internal representation of strings, only about the actual data (a sequence of characters) they contain. Whether some of those characters (in particular, those in the range\x80to\xFF) are internally represented by one or two bytes should not matter.Ps. Reading
perldoc Encodemay help to clarify issues with bytes and characters in Perl.