I’m writing scripts to clean up unicode text files (stored as UTF-8), and I

Question

0

Asked: June 9, 20262026-06-09T18:04:56+00:00 2026-06-09T18:04:56+00:00

I’m writing scripts to clean up unicode text files (stored as UTF-8), and I

0

I’m writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I’m doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:

print(label, msg)
... in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>

If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:

print(label, msg.encode())

This also works for printing objects or lists containing unicode strings–something I often have to do when debugging–since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I’m not confident about handling all the argument permutations that print() supports.

Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)

Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*

However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())

b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'

I’m not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T18:04:57+00:00

Editorial Team

2026-06-09T18:04:57+00:00Added an answer on June 9, 2026 at 6:04 pm

print doesn’t default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing scripts to clean up unicode text files (stored as UTF-8), and I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply