I’m developing a php script involving parsing data from xls files. I’m using library phpexcelreader. All mostly works, but I stumbled upon a strange problem. Some files are parsed incorrecty. Looks like xls files may use different character encodings internally. At least, then I pipe output from my script through iconv -f cp1251 -t utf8, strings get corrected.
Phpexcelreader has an option for specifing output encoding, but looks like it lacks an ability detect input encoding. Any ideas?
The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.
If you define
among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:
Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as
and create a _CodePageNumberToName method to return the codepage name from its numeric value:
And store the returned value in $_defaultEncoding
Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place