How can I handle utf8 using Perl (or Python) on the command line?
I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:
$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c d e f
But with utf8 it doesn’t work, of course:
$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5> <D0> <B7> <D0> <B0>
because it doesn’t know about the 2-byte characters.
It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.
The “-C” flag controls some of the Perl Unicode features (see
perldoc perlrun):To specify encoding used for stdin/stdout you could use
PYTHONIOENCODINGenvironment variable:If you’d like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use
/\X/regular expression:See Grapheme Cluster Boundaries
In Python
\Xis supported byregexmodule.