I’ve been working on a statistical translation system for haiti (code.google.com/p/ccmts) that uses a C++ backend (http://www.statmt.org/moses/?n=Development.GetStarted) and Python drives the C++ engine/backend.
I’ve passed a UTF-8 Python string into a C++ std::string, done some processing, gotten a result back into Python and here is the string (when printed from C++ into a Linux terminal):
mwen bezwen ã ¨ d medikal
- What encoding is that? Is it a double encoded string?
- How do I “fix it” so it’s renderable?
- Is that printed in that fashion because I’m missing a font or something?
The Python chardet library says:
{'confidence': 0.93812499999999999, 'encoding': 'utf-8'}
but, Python, when I run a string/unicode/codecs decode gives me the old:
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 30: ordinal not in range(128)
Oh and Python prints that same exact string into standard output.
A repr() call prints the following: ‘ mwen bezwen \xc3\xa3 \xc2\xa8 d medikal ‘
It looks like a case of garbage in, garbage out. Here are a few clues on how to see what you’ve got in your data.
repr()andunicodedata.name()are your friends.Update:
If (as A. N. Other implies) you are letting the package choose the output language at random, and you suspect its choice is e.g. Korean (a) tell us (b) try to decode the output using a codec that’s relevant to that language …. here are not only Korean but also two each of Chinese, Japanese, and Russian:
None very plausible, really, especially the koi8-r. Further suggestions: Inspect the documentation of the package you interfacing with (URL please!) … what does it say about encoding? Between which two languages are you trying it? Does “mwen bezwen” make any sense in the expected output language? Try a much larger sample of text — does chardet still indicate UTF-8? Does any of the larger output make sense in the expected output language? Try it translating English to another language that uses only ASCII — do you get meaningful ASCII output? Do you care to divulge your Python code and your swig interface code?
update 2 The information flow is interesting: “a string processing app” -> “a statistical language translation system” -> “a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org)”
Please try to replace “unknown” by the facts in the following:
Test 2 obtained from both Google Translate (alpha) and
Microsoft Translate (beta):
Mwen bezwen èd medikal.The third word is LATIN SMALL LETTER E with GRAVE (U+00E8) followed by ‘d’.
Update 3
You said “””input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) “””
Assuming (you’ve never stated this explicitly) that all your files should be encoded in UTF-8:
The zip file of aligned en-fr-ht corpus has several files that crash when one attempts to decode them as UTF-8.
Diagnosis of why this happens:
chardet is useless (in this case); it faffs about for a long time and comes back with a guess of ISO-8859-2 (Eastern Europe aka Latin2) with a confidence level of 80 to 90 pct.
Next step: chose the ht-en directory (ht uses fewer accented chars than fr therefore easier to see what is going on).
Expectation: e-grave is the most frequent non-ASCII character in presumed-good ht text (a web site, CMU files) … about 3 times as many as the next one, o-grave. The 3rd most frequent one is lost in the noise.
Got counts of non-ascii bytes in file hten.txt. Top 5:
The last three rows are explained by
The first 2 rows are explained by
Explanations that include latin1 or cp1252 don’t hold water (8a is a control character in latin1; 8a is S-caron in cp1252).
Inspection of the contents reveals that the file is a conglomeration of multiple original files, some UTF-8, at least one cp850 (or similar). The culprit appears to be the Bible!!!
The mixture of encodings explains why chardet was struggling.
Suggestions:
(1) Implement checking of encoding on all input files. Ensure that they are converted to UTF-8 right up front, like at border control.
(2) Implement a script to check UTF-8 decodability before release.
(3) The orthography of the Bible text appears (at a glance) to be different to that of websites (many more apostrophes). You may wish to discuss with your Creole experts whether your corpus is being distorted by a different orthography … there is also the question of the words; do you expect to get much use of unleavened bread and sackcloth & ashes? Note the cp850 stuff appears to about 90% of the conglomeration; some Bible might be OK but 90% seems over the top.
(4) Why is Moses not complaining about non-UTF-8 input? Possibilities: (1) it is working on raw bytes i.e. it doesn’t convert to Unicode (2) it attempts to convert to Unicode, but silently ignores failure 🙁