I’m looking for a best-practices approach to adding support for Asian character sets to an existing database. We have existing tables that are in the latin1 charset:
show create table books
CREATE TABLE `books` (
`id` varchar(255) NOT NULL,
`category` varchar(255) default NULL,
`contactEmail` varchar(255) default NULL,
`description` text,
`price` varchar(255) default NULL,
PRIMARY KEY (`id`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Currently when we enter UTF8 chars for the description field, we get back ‘?’ chars for Asian chars on the round-trip. Latin1 chars work just fine.
Can I simply convert this table with something like this?
ALTER TABLE books CONVERT TO CHARACTER SET utf8
I understand that this won’t magically fix data already present in the table. I just want it to work properly for new data going forward.
Do I need to worry about collation? I have no idea how that would work for non-latin characters.
Would it make sense to make utf8 the default for the database? Are there any caveats to that?
Thanks
I don’t have much experience with how MySQL handles character sets, but I have experience with character sets in general.
Because your table is using
latin1for encoding, it can only store characters that are present in the latin1 character set. Latin1 is a shorthand for ISO-8859-1, you can see what characters it has — no Asian characters, which is why they won’t store. I’m a little surprised MySQL doesn’t error on such input.UTF-8 would be a good choice if you need to store characters from multiple languages. UTF-8, as a Unicode encoding, will let you store any Unicode character (there are literally thousands of them), from many languages. You could store the string “Dog café θλφ 你好” using UTF-8. UTF-8 is widely used, and is able to encode just about anything — I highly recommend it.
I would peruse the Internet to find literature on converting MySQL tables, to make sure there aren’t any gotchas. If this is production data, test on an offline dataset — a development table or a QA table.
Last, you seem to indicate that there are half-stored Asian characters somehow in your DB. I’d figure out what extactly is stored: if it’s the UTF-8 sequence for the Asian character, but the database thinks it’s latin1 (a classic case of mojibake), some recovery may be possible. I would worry that the conversion may attempt to transform the UTF-8 code units as if they were latin1, resulting in very interesting output. Test test test.