I have the following function: def storeTaggedCorpus(corpus, filename): corpusFile = codecs.open(filename, mode = ‘w’,

Question

0

Asked: May 26, 20262026-05-26T15:47:47+00:00 2026-05-26T15:47:47+00:00

I have the following function: def storeTaggedCorpus(corpus, filename): corpusFile = codecs.open(filename, mode = ‘w’,

0

I have the following function:

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        tagged_token = tagged_token.decode('ISO-8859-1')
        tagged_token = tagged_token.encode('utf-8')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

And when I execute it, I’ve got the following error:

(...) in storeTaggedCorpus
    corpusFile.write(tagged_token)
  File "c:\Python26\lib\codecs.py", line 691, in write
    return self.writer.write(data)
  File "c:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). If the
corpusFile.write(tagged_token) is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8'), it will also work, BUT the resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I’m analizing pt-br text, this is unacceptable.

I believe that everything would work fine if the corpusFile opened as UTF-8, but I can’t get it to work. I’ve searched the Web, but everything I found about Python/Unicode dealt with something else…s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.

Edit — About the `corpus` parameter

I don’t know the encoding of the corpus string. It was generated by PlaintextCorpusReader.tag() method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1') is just a guess. I’ve tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T15:47:48+00:00

When you open the file with codec.open('w', encoding='utf8'), there is no point in writing byte arrays (str objects) into the file. Instead, write unicode objects, like this:

corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token)
corpusFile.write(u'\n')

This will write platform-dependent End-Of-Line characters.

Alternatively, open a binary file and write byte arrays of already-encoded strings:

corpusFile = open(filename, mode = 'wb')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token.encode('utf-8'))
corpusFile.write('\n')

This will write platform-independent EOLs. If you want a platform-dependent EOL, print os.sep instead of '\n'.

Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8 is what you want.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following function: def storeTaggedCorpus(corpus, filename): corpusFile = codecs.open(filename, mode = ‘w’,

Edit — About the corpus parameter

Leave an answerCancel reply

1 Answer

Edit — About the `corpus` parameter

Leave an answer
Cancel reply