How can I handle utf8 using Perl (or Python) on the command line? I

Question

0

Editorial Team

Asked: May 31, 20262026-05-31T08:55:11+00:00 2026-05-31T08:55:11+00:00

How can I handle utf8 using Perl (or Python) on the command line? I

0

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn’t work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn’t know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T08:55:12+00:00

The “-C” flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а

If you’d like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How can I handle utf8 using Perl (or Python) on the command line? I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply