gb2312 is a double byte character set, using mb_strlen() to check a single chinese character will return 2, but for 2 more characters,sometimes the result is weird, anybody know why? how can I get the right length?
<?php
header('Content-type: text/html;charset=utf-8');//
$a="大";
echo mb_strlen($a,'gb2312'); // output 2
echo mb_strlen($a.$a,'gb2312'); // output 3 , it should be 4
echo mb_strlen($a.'a','gb2312'); // output 2, it should be 3
echo mb_strlen('a'.$a,'gb2312'); // output 3,
?>
thanks deceze, your document is very helpful, people know little about encoding like me should read it.What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Your string is probably stored as UTF-8.
The UTF-8 code for
"大"isE5 A4 A7(according to this webpage), so:This is just a guess, but perfectly make sense to me if thinking this way. You can probably refer to this wikipedia page.
If you really want to test, I recommend you to create a separated file saved in gb2312 encoding, and use
fopenor whatever to read it. Then you will be sure that it is in the desired encoding.