I have textfiles that use utf-8 encoding that contain characters like ‘ö’, ‘ü’, etc. I would like to parse the text form these files, but I can’t get the tokenizer to work properly. If I use standard nltk tokenizer:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)
Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']
Punkt tokenizer seems to do better:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)
output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']
There is still ‘\ufeff’ before the first token that i can’t figure out (not that I can’t remove it). What am I doing wrong? Help greatly appreciated.
It’s more likely that the
\uFEFFchar is part of the content read from the file. I doubt it was inserted by the tokeniser.\uFEFFat the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.Was the file written by Microsoft Notepad? From the codecs module docs:
Try reading your file using
codecs.open()instead. Note the"utf-8-sig"encoding which consumes the BOM.Experiment: