I am preparing a new database server, where I will migrate data from a big, existing, multilingual database (mostly english/french/spanish text, rarely special characters from other languages for e.g. city names). It will be mostly used with PHP applications developed by me and my colleagues.
I have difficulties understanding all the characters sets issues, and I would like to make the right choice from the start.
From what I read, in order to support all Unicode characters, I should use UTF-8.
My Questions:
-
Which characters set/collation should I set in MicroSoft SQL Server 2008 to obtain UTF-8? Is Latin1_general_CS_AS the right choice?
-
Should I use this meta in my HTML pages?
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=utf-8">
-
Will there be characters that I won’t be able to support in my database, or that I will need to convert in some way?
Character set and collation are different things.
SQL Server has no support for UTF-8. You should store your data as Unicode, which means that the column type should be
NCHARandNVARCHAR. You can choose any collation you like, because any collation you choose, it will be incorrect. Collation determines the way values are sorted and compared, not the encoding they are stored with (drivers interpret collation information as an encoding hint for non-Unicode types, but that is a different topic). As you are mixing various languages, there is no possible correct sorting order (ie. you application will suffer from the notorious TurkishIand Spanishchsorting issues). However, this is in general not a big issue and users seldom notice it. Overall though, a Latin collation will probably be best.As for your return HTTP charset: you should put the charset you used to return the page as. What encoding SQL Server uses to store the data is completely irrelevant. A lot of developers run into issues here because they use a non-Unicode data type in SQL Server (ie.
CHARandVARCHAR) which results in many encoding incompatibilities in the returned HTTP data. Simply using Unicode column types will resolve most issues, as long as you don’t do anything stupid in your own application code (like trying to force an encoding).BTW, since you mention that most applications will be PHP, with PHP is likely you will need to convert the encoding from the SQL Server Unicode UCS-2 to your desired output format (UTF-8). Make sure you read Microsoft Drivers for PHP for SQL Server Unicode Support and Endianness and use the ucs-2le encoding for SQL Server data.