I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there’s one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.
I’m trying to determine whether UTF-8 will do the trick or not. I’ve heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.
So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?
Any given Unicode string will have only one representation in UTF-8.
I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.
But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it’s perfectly reversible.
The conversion rules are here:
http://en.wikipedia.org/wiki/UTF-8