I am running the following code:
def displayFileOld(file_path):
f = open(file_path, mode = 'rt', encoding = 'cp1252', errors = 'replace')
while True:
line = f.readline()
if len(line) == 0:
break
print(line)
under Python 3.3, Windows 8 Pro.
The file that I am “parsing” (Java source file) is shown by Eclipse as being encoded in Cp1252 (“inherited from the main container”). Notepad++ says nothing more under the Encoding menu than “ANSI”. These two match.
First of all, I would expect the encoding to Unicode to…work. It fails, though, with the message:
Traceback (most recent call last):
File "C:\work\test.py", line 69, in <module>
main()
File "C:\work\test.py", line 65, in main
displayFileOld(r'C:\work\CVSProvisioningFeatures.java')
File "C:\work\test.py", line 48, in displayFileOld
print(line)
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 62-63: character maps to <undefined>
Second, I wouldn’t expect to have my stack trace mention cp437.py, instead of the *.py file corresponding to the encoding I have mentioned in the flag. (The parsing fails when the “†” character is encountered, not sure how Unicode would not include this one – this is the context: ‘new FeatureDescription(i++,”†† “+str));’).
Third, I am not sure why the errors flag is ignored altogether.
I have spent a few hours trying the different encodings that are hosted under the generic “ANSI” umbrella, but in vain. All I can do is catch the exception and ignore the line (not acceptable). Another approach is to use some “exotic” encoding such as MacRoman, but that still leaves me with some unexpected characters (albeit I get 12 errors only instead of 431) after going through the whole source tree…characters that I will ultimately need to forward work with, passing tons of strings around. I have about 50 MB of Java sources to work on using a script, so any help getting this set up would be greatly appreciated.
Your problem is not with reading the file, but with printing; the traceback shows that the line
print(line)preceeds theUnicodeEncodeError(note the Encode in that exception). When you read a file, you are decoding from cp1252 tounicodeobjects, and that is working just fine.Your windows terminal is using codepage 437 and cannot handle the characters you are trying to print. Python needs to convert your data from unicode to whatever your terminal is using to be able to display the characters to you.
You can switch your terminal codepage with the
chcp 65001command (not a Python expresssion but a Windows commandline tool). Codepage 65001 is the UTF-8 codepage, which can handle all Unicode code points. You may need to switch fonts to be able to display these characters too. See Unicode characters in Windows command line – how?