I have a rails app that receives data from an Android device. I noticed that some of the data, when in Japanese, is not saved correctly. It shows up as literal question marks (not the diamond ones) in the MySQL client and in the rails website.
It turns out that the database that I have connected to the rails app is set to Latin1. Rails is set to UTF-8.
I read a lot about character encodings, but they all mention that the data is somehow a bit readable. Mine however is only literal question marks. Also trying to convert the data to UTF-8 using several methods on the web doesn’t change a thing. I suspect that the data is converted to question marks when it’s written to the database.
Sample output from the MySQL console:
select * from foo where bar = "foobar";
+-------+------+------------------------+---------------------+---------------------+
| id | name | bar | created_at | updated_at |
+-------+------+------------------------+---------------------+---------------------+
| 24300 | ???? | foobar | 2012-01-23 05:04:22 | 2012-01-23 05:04:22 |
+-------+------+------------------------+---------------------+---------------------+
1 row in set (0.00 sec)
The input data, that my rails app got from the Android client was:
name = 爆笑笑話
This input data has been verified to exist in the rails app before saving to the database. So it’s not mangled in the Android client or during transfer to the server. Is there any chance I can get this data back? Or is it completely lost?
It’s actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either
SELECT BINARY name FROM foo WHERE bar = 'foobar'orSELECT HEX(name) FROM foo WHERE bar = 'foobar'.Where the character
爆is expected, you will likely find either of the following byte sequences:0xe78886, indicating that your column actually contains UTF-8 encoded data: this usually happens when the character set of the database connection over which the text was originally inserted was set tolatin1but actually UTF-8 encoded data was sent.You must be seeing
?characters when fetching the data because something between the data storage and the display has been unable to transcode those bytes (however, given that MySQL thinks they represent爆and those characters are likely available in most character sets, it’s unlikely that it’s occurring within MySQL itself—unless you’re explicitly adjusting the encoding information during retrieval).Anyway, if this is the case, you need to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under
ALTER TABLESyntax:0x3f, indicating that the database does actually contain the literal character?and your original data has been lost: this doesn’t happen easily, since MySQL usually throws error 1366 if implicit transcoding results in loss of data. Perhaps there was some explicit transcoding in your insert statement?In this case, you need to convert the storage encoding to a suitable format, then update or re-insert the data: