Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 646613
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T21:37:53+00:00 2026-05-13T21:37:53+00:00

I am trying to parse a CSV file containing some data, mostly numeral but

  • 0

I am trying to parse a CSV file containing some data, mostly numeral but with some strings – which I do not know their encoding, but I do know they are in Hebrew.

Eventually I need to know the encoding so I can unicode the strings, print them, and perhaps throw them into a database later on.

I tried using Chardet, which claims the strings are Windows-1255 (cp1255) but trying to do print someString.decode('cp1255') yields the notorious error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

I tried every other encoding possible, to no avail. Also, the file is absolutely valid since I can open the CSV in Excel and I see the correct data.

Any idea how I can properly decode these strings?


EDIT: here is an example. One of the strings looks like this (first five letters of the Hebrew alphabet):

print repr(sampleString)
#prints:
'\xe0\xe1\xe2\xe3\xe4'

(using Python 2.6.2)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T21:37:53+00:00Added an answer on May 13, 2026 at 9:37 pm

    This is what’s happening:

    • sampleString is a byte string (cp1255 encoded)
    • sampleString.decode("cp1255") decodes (decode==bytes -> unicode string) the byte string to a unicode string
    • print sampleString.decode("cp1255") attempts to print the unicode string to stdout. Print has to encode the unicode string to do that (encode==unicode string -> bytes). The error that you’re seeing means that the python print statement cannot write the given unicode string to the console’s encoding. sys.stdout.encoding is the terminal’s encoding.

    So the problem is that your console does not support these characters. You should be able to tweak the console to use another encoding. The details on how to do that depends on your OS and terminal program.

    Another approach would be to manually specify the encoding to use:

    print sampleString.decode("cp1255").encode("utf-8")
    

    See also:

    • http://wiki.python.org/moin/PrintFails
    • Setting the correct encoding when piping stdout in Python

    A simple test program you can experiment with:

    import sys
    print sys.stdout.encoding
    samplestring = '\xe0\xe1\xe2\xe3\xe4'
    print samplestring.decode("cp1255").encode(sys.argv[1])
    

    On my utf-8 terminal:

    $ python2.6 test.py utf-8
    UTF-8
    אבגדה
    
    $ python2.6 test.py latin1
    UTF-8
    Traceback (most recent call last):
    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)
    
    $ python2.6 test.py ascii
    UTF-8
    Traceback (most recent call last):
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
    
    $ python2.6 test.py cp424
    UTF-8
    ABCDE
    
    $ python2.6 test.py iso8859_8
    UTF-8
    �����
    

    The error messages for latin-1 and ascii means that the unicode characters in the string cannot be represented in these encodings.

    Notice the last two. I encode the unicode string to the cp424 and iso8859_8 encodings (two of the encodings listed on http://docs.python.org/library/codecs.html#standard-encodings that supports hebrew characters). I get no exception using these encodings, since the hebrew unicode characters have a representation in the encodings.

    But my utf-8 terminal gets very confused when it receives bytes in a different encoding than utf-8.

    In the first case (cp424), my UTF-8 terminal displays ABCDE, meaning that the utf-8 representation of A corresponds to the cp424 representation of ה, i.e. the byte value 65 means A in utf-8 and ה in cp424.

    The encode method has an optional string argument you can use to specify what should happen when the encoding cannot represent a character (documentation). The supported strategies are strict (the default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies.

    Another test program (I print with quotes around the string to better show how ignore behaves):

    import sys
    samplestring = '\xe0\xe1\xe2\xe3\xe4'
    print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], 
          sys.argv[2]))
    

    The results:

    $ python2.6 test.py latin1 strict
    Traceback (most recent call last):
      File "test.py", line 4, in <module>
        sys.argv[2]))
    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)
    [/tmp]
    $ python2.6 test.py latin1 ignore
    ''
    [/tmp]
    $ python2.6 test.py latin1 replace
    '?????'
    [/tmp]
    $ python2.6 test.py latin1 xmlcharrefreplace
    '&#1488;&#1489;&#1490;&#1491;&#1492;'
    [/tmp]
    $ python2.6 test.py latin1 backslashreplace
    '\u05d0\u05d1\u05d2\u05d3\u05d4'
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 414k
  • Answers 414k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer I have been struggling with this for couple of days… May 15, 2026 at 8:49 am
  • Editorial Team
    Editorial Team added an answer The RoutedEventArgs in your OnTextboxGoToPageKeyDown has a property named Source… May 15, 2026 at 8:49 am
  • Editorial Team
    Editorial Team added an answer $obj = $_GET['obj']; $validArray = array('a','b','c','d','e'); if (in_array($obj,$validArray)) { include… May 15, 2026 at 8:49 am

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.