I have a file, which I read from test. This file is UTF-8. It contains, in my simple example, only the Danish letter “Ø”.
I then have a Python script, which reads this file, and in this example, just prints every line.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import sys
filename = sys.argv[1]
f = codecs.open(filename, 'r', 'utf-8')
for lines in f:
print lines
Call this parse.py. Now when I run ./parse.py test in my terminal I get the following output:
Ø
Calling instead ./parse.py test | less gives me:
Traceback (most recent call last):
File "./test.py", line 12, in <module>
print lines
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd8' in position 11: ordinal not in range(128)
I am certain my test file is ‘UTF-8’:
$ file -I test
test: text/plain; charset=utf-8
As well as my $LC_TYPE being ‘UTF-8’
What am I doing wrong? How do I get it to work, so that I can pass the output of parse.py to the next command?
This is probably a problem with less, see this article for some tips. Maybe changing the configuration of less will help.
Ok, this wasn’t the problem…so updating the answer based on the comments.
Needed to encode the string before print. This article gives the reason, summed up: python needs to be told how to encode the unicode.