I’ve got a database with UTF-8 characters in it, which are improperly displayed. I figured that I could use UNHEX(HEX(column)) != column condition to know what fields have UTF-8 characters in them. The results are rather interesting:
id | content | HEX(content) | UNHEX(HEX(content)) LIKE '%c299%' | UNHEX(HEX(content)) LIKE '%FFF%' | UNHEX(HEX(content))
49829102 | | C299 | 0 | 0 | c299
874625485 | FFF | 464646 | 0 | 1 | FFF
How is this possible and, possibly, how can I find the row with this character in it?
— edit(2): since my edit has been removed (probably when JamWaffles was fixing my beautiful data table), here it is again: as editor strips out UTF-8 characters, the content in first row is \uc299 (if that’s not clear 😉 )
— edit(3): I’ve figured out what the issue is – the actual representation of UNHEX(HEX(content)) is WRONG – to display my multibyte character I had to do the following: SELECT UNHEX(SUBSTR(HEX(content),1))). Sadly UNHEX(C299) doesn’t work as UNHEX(C2)+UNHEX(99) so it’s back to the drawing board.
There are two ways to determine if a string contains UTF-8 specific characters. The first is to see if the string has values outside the ASCII character set:
The second is to compare the binary and character lengths:
Both return
TRUE.See http://sqlfiddle.com/#!2/d41d8/9811