I need some help understanding the concept of a well-formed UTF-16 string as mentioned on these two paragraphs at Chapter 2: General Structure 2.7 Unicode String:
“Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.
Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed when that string is specified to be well formed UTF-16.
The paragraph explains it for UTF-16; not well-formed means the string contains isolated surrogate codeunits.
That is, there are certain code units which are only valid when they appear in pairs. A code unit in the range [0xD800-0xDFFF] must occur only in pairs where the first must be in the range [0xD800-0xDBFF] and the second must be in the range [0xDC00-0xDFFF]. If a string does not obey this requirement then it is not well-formed.