Please note, sarnold has heavily edited the question; the original
question, in its entirety, is kept in the question as a comment. If I made
something unclear, perhaps the original post will be helpful. (I’m leaving
it as a comment so future editors do not need to always refer to the
question edit history.)
I’m working with Delphi Xe2 and need help understanding how to use ANSI
strings, Unicode strings, and Wide-character strings, correctly,
especially when writing a DLL intended for use with other languages (such
as VB, C++, or C#).
I need to write a DLL using Delphi Xe2 to perform simple string operations
on Unicode strings. This DLL needs to work with one is SimpleShareMem or
ShareMem or without memory managers. This DLL needs to be callable from foreign
languages such as VB, C++, and C#.
By default, strings should now be Unicode strings. Should we use
Embarcadero to work with these strings?
Strings are either: (a) single-byte characters that do not support Unicode
or (b) wide strings, where each character requires two bytes. (These do
support Unicode, but they are not UTF-8 strings.)
There are two pointer types available: PAnsiChar and PWideChar (there
is no PUnicodeChar pointer available). PChar is an alias for
PWideChar — does this means we always need to allocate 2 * length
amount of memory for these strings? (And, similarly, do we need to divide
the memory by 2 to get the length of these strings?)
For string constants, do we need to mark the type of the string in the
source code? E.g.:
Const MyCo = 'test';
or
Const MyCo = WideString('test');
How about when we perform assignments between string variables?
s := st;
Should this be re-written:
s := WideString(st);
Should we include the Unicode Byte Order mark in our strings? How should
we include the BOM in our strings?
How should we work with ANSI strings in different Windows Code Pages? If
we receive an ANSI string with code page 1200, should we re-code the
string or work with it as-is?
How should we use the TEncoding class to convert between Unicode, UTF-8,
WideString, and AnsiString classes?
Are there any severe performance penalties using wide strings or Unicode
strings?
Should we write our interfaces to require working with only the WideString
variants when using the general memory manager?
Should we write our interfaces to require length parameters for PChar,
PAnsiChar, and PWideChar parameter types?
How do write our interfaces to determine if a file is stored in Unicode,
UTF-8, ANSI, or Wide Characters? How should we determine what format to
use when writing files back out?
Should we use only procedures? Or can functions work too?
Thanks, and happy new year.
I get the impression that Gu is moving from Delphi 7 to a Unicode enabled version (D2009+) and is looking for advice on how to deal with the new strings.
Cary Jensen’s white paper Delphi Unicode Migration for Mere Mortals, addresses most if not all of the issues raised in the question.
I would normally have put this in a comment, but the list of comments is already so long I felt the link (which may help more people than just Gu) would more easily be found in an answer.