Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 627989
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T19:35:17+00:00 2026-05-13T19:35:17+00:00

I’ve been working on a statistical translation system for haiti (code.google.com/p/ccmts) that uses a

  • 0

I’ve been working on a statistical translation system for haiti (code.google.com/p/ccmts) that uses a C++ backend (http://www.statmt.org/moses/?n=Development.GetStarted) and Python drives the C++ engine/backend.

I’ve passed a UTF-8 Python string into a C++ std::string, done some processing, gotten a result back into Python and here is the string (when printed from C++ into a Linux terminal):

mwen bezwen ã ¨ d medikal

  1. What encoding is that? Is it a double encoded string?
  2. How do I “fix it” so it’s renderable?
  3. Is that printed in that fashion because I’m missing a font or something?

The Python chardet library says:

{'confidence': 0.93812499999999999, 'encoding': 'utf-8'}

but, Python, when I run a string/unicode/codecs decode gives me the old:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 30: ordinal not in range(128)

Oh and Python prints that same exact string into standard output.

A repr() call prints the following: ‘ mwen bezwen \xc3\xa3 \xc2\xa8 d medikal ‘

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T19:35:18+00:00Added an answer on May 13, 2026 at 7:35 pm

    It looks like a case of garbage in, garbage out. Here are a few clues on how to see what you’ve got in your data. repr() and unicodedata.name() are your friends.

    >>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
    >>> print repr(s.decode('utf8'))
    u' mwen bezwen \xe3 \xa8 d medikal '
    >>> import unicodedata
    >>> unicodedata.name(u'\xe3')
    'LATIN SMALL LETTER A WITH TILDE'
    >>> unicodedata.name(u'\xa8')
    'DIAERESIS'
    >>>
    

    Update:

    If (as A. N. Other implies) you are letting the package choose the output language at random, and you suspect its choice is e.g. Korean (a) tell us (b) try to decode the output using a codec that’s relevant to that language …. here are not only Korean but also two each of Chinese, Japanese, and Russian:

    >>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
    >>> for enc in 'euc-kr big5 gb2312 shift-jis euc-jp cp1251 koi8-r'.split():
        print enc, s.decode(enc)
    
    
    euc-kr  mwen bezwen 찾 짢 d medikal 
    big5  mwen bezwen 瓊 穡 d medikal 
    gb2312  mwen bezwen 茫 篓 d medikal 
    shift-jis  mwen bezwen テ」 ツィ d medikal 
    euc-jp  mwen bezwen 達 即 d medikal 
    cp1251  mwen bezwen ГЈ ВЁ d medikal 
    koi8-r  mwen bezwen цё б╗ d medikal 
    >>> 
    

    None very plausible, really, especially the koi8-r. Further suggestions: Inspect the documentation of the package you interfacing with (URL please!) … what does it say about encoding? Between which two languages are you trying it? Does “mwen bezwen” make any sense in the expected output language? Try a much larger sample of text — does chardet still indicate UTF-8? Does any of the larger output make sense in the expected output language? Try it translating English to another language that uses only ASCII — do you get meaningful ASCII output? Do you care to divulge your Python code and your swig interface code?

    update 2 The information flow is interesting: “a string processing app” -> “a statistical language translation system” -> “a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org)”

    Please try to replace “unknown” by the facts in the following:

    Input language: English (guess)
    Output language: Haitian Creole
    Operating system: linux
    Python version: unknown
    C++ package name: unknown
    C++ package URL: unknown
    C++ package output encoding: unknown
    
    Test 1 input: unknown
    Test 1 expected output: unknown
    Test 1 actual output (utf8): ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
    [Are all of those internal spaces really in the string?]
    
    Test 2 input: 'I need medical aid.'
    Test 2 expected output (utf8): 'Mwen bezwen \xc3\xa8d medikal.'
    Test 2 actual output (utf8): unknown
    

    Test 2 obtained from both Google Translate (alpha) and
    Microsoft Translate (beta):
    Mwen bezwen èd medikal.
    The third word is LATIN SMALL LETTER E with GRAVE (U+00E8) followed by ‘d’.

    Update 3

    You said “””input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) “””

    Assuming (you’ve never stated this explicitly) that all your files should be encoded in UTF-8:

    The zip file of aligned en-fr-ht corpus has several files that crash when one attempts to decode them as UTF-8.

    Diagnosis of why this happens:

    chardet is useless (in this case); it faffs about for a long time and comes back with a guess of ISO-8859-2 (Eastern Europe aka Latin2) with a confidence level of 80 to 90 pct.

    Next step: chose the ht-en directory (ht uses fewer accented chars than fr therefore easier to see what is going on).

    Expectation: e-grave is the most frequent non-ASCII character in presumed-good ht text (a web site, CMU files) … about 3 times as many as the next one, o-grave. The 3rd most frequent one is lost in the noise.

    Got counts of non-ascii bytes in file hten.txt. Top 5:

    8a 99164
    95 27682
    c3 8210
    a8 6004
    b2 2159
    

    The last three rows are explained by

    e-grave is c3 a8 in UTF-8
    o-grave is c3 b2 in UTF-8
    2159 + 6004 approx == 8210
    6004 approx == 3 * 2159
    

    The first 2 rows are explained by

    e-grave is 8a in old Western Europe DOS encodings like cp850!!
    o-grave is 95 in old Western Europe DOS encodings like cp850!!
    99164 approx == 3 * 27682
    

    Explanations that include latin1 or cp1252 don’t hold water (8a is a control character in latin1; 8a is S-caron in cp1252).

    Inspection of the contents reveals that the file is a conglomeration of multiple original files, some UTF-8, at least one cp850 (or similar). The culprit appears to be the Bible!!!

    The mixture of encodings explains why chardet was struggling.

    Suggestions:

    (1) Implement checking of encoding on all input files. Ensure that they are converted to UTF-8 right up front, like at border control.

    (2) Implement a script to check UTF-8 decodability before release.

    (3) The orthography of the Bible text appears (at a glance) to be different to that of websites (many more apostrophes). You may wish to discuss with your Creole experts whether your corpus is being distorted by a different orthography … there is also the question of the words; do you expect to get much use of unleavened bread and sackcloth & ashes? Note the cp850 stuff appears to about 90% of the conglomeration; some Bible might be OK but 90% seems over the top.

    (4) Why is Moses not complaining about non-UTF-8 input? Possibilities: (1) it is working on raw bytes i.e. it doesn’t convert to Unicode (2) it attempts to convert to Unicode, but silently ignores failure 🙁

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 354k
  • Answers 354k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer I also had this requirement. Sadly doc-view doesn't provide this… May 14, 2026 at 8:20 am
  • Editorial Team
    Editorial Team added an answer SmpTl is the namespace CaptureController is defined in, as it… May 14, 2026 at 8:20 am
  • Editorial Team
    Editorial Team added an answer Edit: Back in the days where this question was asked… May 14, 2026 at 8:20 am

Related Questions

I've got a string that has curly quotes in it. I'd like to replace
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I have a French site that I want to parse, but am running into
I have text I am displaying in SIlverlight that is coming from a CMS

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.