How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX C interfaces?
Here’s an example. Let’s put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
in MATLAB.
Now if I first do feature('DefaultCharacterSet', 'UTF-8'), then from C engEvalString(ep, "s"), then as output I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaced by character code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.), and preserving the non-Latin-1 characters. Is this possible?
My platform is OS X, MATLAB R2012b.
Addendum: The documentation explicitly states that “[mxArrayToString()] supports multibyte encoded characters”, yet it still gives me only a Latin-1 approximation to the original data.
First, let me share a few references I found online:
According to
mxChardescription,Still the term MBCS is somewhat ambiguous to me, I think they meant UTF-16 in this context (although I’m not sure about surrogate pairs, which probably makes it UCS-2 instead).
UPDATE: MathWorks changed the wording to:
The
mxArrayToStringpage states that it does handle multibyte encoded characters (unlinkemxGetStringwhich only handles single-byte encoding schemes). Unfortunately, no example on how to do this.Finally, here is a thread on the MATLAB newsgroup which mentions a couple of undocumented function that are related to this (you can find those yourself by loading the
libmx.dlllibrary into a tool like Dependency Walker on Windows).Here’s a small experiment I did in MEX:
my_func.c
I create three strings in C code encoded with ASCII, UTF-8, and UTF-16LE respectively. I then pass them to MATLAB using the
mxCreateStringMEX function (and other undocumented versions of it).I got the byte sequences by consulting Fileformat.info website:
A (U+0041), À (U+00C0), and 水 (U+6C34).
Let’s test the above function inside MATLAB:
I am making use of the embedded Java capability to view the strings:
Now let’s work in the reverse direction (accepting a string from MATLAB into C):
my_func_reverse.c
And we test this from inside MATLAB:
Finally I should say that if for some reason you are still having problems,
the easiest thing would be to convert the non-ASCII strings to
uint8datatypebefore passing this from MATLAB to your engine program.
So inside the MATLAB process do:
and access the variable using the Engine API as:
All tests were done on WinXP running R2012b with the default charset:
Hope this helps..
EDIT:
In MATLAB R2014a, many undocumented C functions were removed from
libmxlibrary (including the ones used above), and replaced with equivalent C++ functions exposed under the namespacematrix::detail::noninlined::mx_array_api.It should be easy to adjust the examples above (as explained here) to run on the latest R2014a version.