I have a multi-language website that communicates with a database, which contains language-specific translations.
For example, a table gender has 10 rows, and each row indicates a language.
+---------+-----------+-----+
| English | French | etc |
| Male | Masculine | ... |
+---------+-----------+-----+
Some languages (like Chinese, Greek, Turkish, Spanish, Russian, etc. have characters outside of latin1, and when i read the data from the database on my site they come out with ? and garbled symbols (mojibake)
So, how do I fix this?
I know i need to use certain collation on the db and add the specific meta charset tag but it’s still not working.
cp1256 | Windows Arabic | cp1256_general_ci (it's not giving me the correct arabic solution.)
gbk | GBK Simplified Chinese | gbk_chinese_ci (it's not giving me the correct chinese solution.)
There are a whole load of areas of your system that need to be considered when looking at multi-lingual systems.
You need to to ensure that you are using a suitable character encoding throughout your system. In most cases, the best choice of character encoding is UTF-8. (There are some instances where UTF-8 is insufficient, for which cases there is UTF-16, but these cases are few and far between, and PHP will struggle with UTF-16 anyway, so in general stick with UTF-8 for everything and you’ll be fine).
You need to make sure you’re using the same character encoding in the following places:
The database is easy to deal with: just make sure all tables are created with UTF-8 encoding for their charset. Job done.
Collation is less relevant — this specifies the sort order. This does matter of course, but does not have any relevance to the garbled text display you’re seeing. (it’s worth saying that some characters are sorted differently in different languages, so it’s virtually impossible to pick a collation mode that will suit everyone if you need to support multiple languages in a single table, but I wouldn’t get too worried about this for now).
The web server is relatively simple too, as long as you’re comfortable with Apache config (or whatever server software you’re using). You need to ensure that all pages output to the browser are sent using UTF-8 encoding.
Finally, your PHP source code…
Firstly, you should make sure you’re editing the actual PHP code files in UTF-8 mode. Otherwise, any you may have trouble if you have any extended characters written in your code.
Secondly, be aware that a number of PHP’s standard string handling functions are “not multi-byte aware”. This means that they don’t work correctly with extended character sets. For example,
strlen()will return the number of bytes the string takes up in memory. This will be incorrect if your string includes characters that take up more than one byte. Fortunately, PHP also supplies a set of multi-byte functions to resolve this. So for example, instead of usingstrlen(), usemb_strlen(). The PHP manual gives more detail about the exact functions available and when to use them.Also, make sure that you handle any incoming posted data with the correct character set as well.
Hopefully that will help you. The key here is to ensure that your system uses a consistent character set throughout all its layers. Problems with weird-looking encoding errors tend to happen when one layer in your system is using a different character set to the others. Make sure they’re all the same (and preferably UTF-8), and you should deal with your garbled character problems.