I have table with words dictionary in my language (latvian).
CREATE TABLE words (
value varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
And let’s say it has 3 words inside:
INSERT INTO words (value) VALUES ('tēja');
INSERT INTO words (value) VALUES ('vējš');
INSERT INTO words (value) VALUES ('feja');
What I want to do is I want to find all words that is exactly 4 characters long and where second character is ‘ē’ and third character is ‘j’
For me it feels that correct query would be:
SELECT * FROM words WHERE value LIKE '_ēj_';
But problem with this query is that it returs not 2 entries (‘tēja’,’vējš’) but all three.
As I understand it is because internally MySQL converts strings to some ASCII representation?
Then there is BINARY addition possible for LIKE
SELECT * FROM words WHERE value LIKE BINARY '_ēj_';
But this also does not return 2 entries (‘tēja’,’vējš’) but only one (‘tēja’). I believe this has something to do with UTF-8 2 bytes for non ASCII chars?
So question:
What MySQL query would return my exact two words (‘tēja’,’vējš’)?
Thank you in advance
The
utf8_bincollation is not just diacritical-sensitive, but also case-sensitive. If you want to match only the letter-with-diacritical and you don’t care about upper/lower case, you would have to find autf_..._cicollation that doesn’t treateandēas the same letter.I can’t immediately see one (there are plenty that don’t collate
ēat all, which would be okay if you only need case-sensitive matching on the non-diacritical letters). Interesting that the Latvian collation treats macron-letters as the same as plain letters, which you don’t want (it knowsšis different froms).Anyway, whatever collation you end up with, you will want to put your tables in that collation rather than manually specifying it in a query, so that comparisons can be properly indexed.