I am connecting to a MS SQL server through SQL Alchemy, using pyodbc module. Everything appears to be working fine, until I began having problems with the encodings. Some of the non-ascii characters are being replaced with ‘?’
The DB has a collation ‘Latin1_General_CI_AS’ (I’ve checked also the specific fields and they keep the same collation). I started selecting the encoding ‘latin1’ in the call of create_engine and that appears to work for Western European character (like French or Spanish, characters like é) but not for Easter European characters. Specifically, I have a problem with the character ć
I have been trying to select other encodings as stated on Python documentation, specifically the Microsoft ones, like cp1250 and cp1252, but I keep facing the same problem.
Does anyone knows how to solve those differences? Does the collation ‘Latin1_General_CI_AS’ has an equivalence on Python encodings?
The code for my current connection is the following
for sqlalchemy import *
def connect():
return pyodbc.connect('DSN=database;UID=uid;PWD=password')
engine = create_engine('mssql://', creator=connect, encoding='latin1')
connection = engine.connect()
Clarifications and comments:
- This problems happens when retrieving information from the DB. I don’t need to store anything.
- At the beginning I didn’t specify the encoding, and the result was that, whenever a non ascii character was encountered on the DB, pyodbc raises a UnicodeDecodeError. I corrected that using ‘latin1’ as encoding, but that doesn’t solve the problem for all the characters.
- I admit that the server is not on latin1, the comment is incorrect. I have been checking both the database collation and the specific fields collations and appears to be all in ‘Latin1_General_CI_AS’, then, how can
ćbe stored? Maybe I’m not correctly understanding collations. - I corrected a little the question, specifically, I have tried more encodings than
latin1, alsocp1250andcp1252(which apparently is the one used on ‘Latin1_General_CI_AS’, according to msdn)
UPDATE:
OK, Following these steps, I get that the encoding used by the DB appears to be cp1252: http://bytes.com/topic/sql-server/answers/142972-characters-encoding
Anyway, that appears to be a bad assumption as reflected on answers.
UPDATE2:
Anyway, after configuring properly the odbc driver, I don’t need to specify the encoding on the Python code.
You should stop using code pages and switch to Unicode. This is the only way of getting rid of this kind of problems.