I am using the PHP PDO_Informix driver v1.2.7 and the Informix client version is 3.70. I have some code in UTF-8 that makes queries to a Latin1 database (the Informix server is 9.21).
The thing is that
the driver is chopping some values of the return strings. It’s like special
characters counts double.
If a column ‘name’ has type varchar(2) and the value of name is ‘áa’ the value
returned when queried is ‘á’ instead of ‘áa’. If I resize the column to
varchar(3)
the result is correct.
Below I attach a short script to reproduce the bug. I included the DSN so you can
see the encoding settings.
Test script:
$dsn = "informix:database=base;server=ol_server;host=192.168.123.123;client_locale=en_us.utf8;db_locale=en_us.819;service=1526;protocol=olsoctcp;EnableScrollableCursors=1";
$db = new \PDO($dsn, 'user', 'pass');
$db->exec("CREATE TABLE ticket82 ( name VARCHAR(2) );");
$db->exec("INSERT INTO ticket82 VALUES ('aa');");
$statement = $db->query("select name from ticket82;");
$value = $statement->fetchAll(\PDO::FETCH_ASSOC);
echo "expected 'aa' got '{$value[0]['NAME']}'\n";
$db->exec("update ticket82 set name='áa';");
$statement = $db->query("select name from ticket82;");
$value = $statement->fetchAll(\PDO::FETCH_ASSOC);
echo "expected 'áa' got '{$value[0]['NAME']}'\n";
$db->exec("ALTER TABLE ticket82 MODIFY (name varchar(3));");
$statement = $db->query("select name from ticket82;");
$value = $statement->fetchAll(\PDO::FETCH_ASSOC);
echo "expected 'áa' got '{$value[0]['NAME']}'\n";
$db->exec("DROP TABLE ticket82;");
Expected result:
expected 'aa' got 'aa'
expected 'áa' got 'áa'
expected 'áa' got 'áa'
Actual result:
expected 'aa' got 'aa'
expected 'áa' got 'á'
expected 'áa' got 'áa'
Any ideas?
In a slightly weird way, I think that is the ‘expected’ or ‘working as designed’ behaviour.
The column size is specified in bytes rather than characters, but for the database code set (ISO 8859-1 aka Latin-1) there is no difference. The client-side code (PDO Informix) assumes that the variable holding it should allow for the same number of bytes storage.
However, the client-side code set is UTF-8 rather than 8859-1, and some of the character codes for 8859-1 characters require 2 bytes in UTF-8. To be precise, the ‘ASCII’ range U+0000..U+007F require 1 byte in UTF-8, but the ‘accented’ range U+0080..U+00FF require 2 bytes. Because the client-side has limited its variables to 2 bytes (rather than 2 characters), you will only be able to select a single accented character from a VARCHAR(2) column.
The codeset conversion between UTF-8 and 8859-1 occurs in a library called GLS (Global Language Support) inside the Informix ClientSDK (CSDK) code that is used by PDO Informix.
This is an interesting setup with the client and database server using different code sets. There’s room to think that the client could usefully use bigger variable sizes when there is a code set conversion going on. Since the database is storing Latin-1, all the characters fall in the Unicode range U+0000..U+00FF. (If it was Latin-15, the Euro symbol € U+20AC requires 3 bytes in UTF-8, for instance; most of the other 8859-x series code sets require one or two bytes per character, I believe.) Handling that sensibly in the codeset conversion environment would require some care, but could be done if the code were aware of the issue. The fix probably belongs in PDO Informix. It is telling the CSDK how much space to use for storing the data, using the byte-count information provided by CSDK and the Informix server.
FYI: Informix 9.21 has been out of support for a long time now (so has 9.30, 9.40 and 10.00 — even 11.10 is out of support, though that is a relatively recent change). However, that is not a factor in this problem.