trace(escape("д"));
will print “%D0%B4”, the correct URL encoding for this character (Cyrillic equivalent of “A”).
However, if I were to do..
myTextArea.htmlText += unescape("%D0%B4");
What gets printed is:
д
which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping “д” returns its unicode code-point “%u0434”.
I’m not sure what exactly is happening to mess this up, but…
UTF-16 д in web encoding is: %FE%FF%00%D0%00%B4
Whereas
UTF-16 д in web encoding is: %00%D0%00%B4
So it’s padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What’s goin’ on?
The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.
The problem is
unescape(escapecould also be a problem, but it’s not the culprit in this case). These functions are not multibyte aware. Whatescapedoes is this: it takes a byte in the input string and returns its hex representation with a%prepended.unescapedoes the opposite. The key point here is that they work with bytes, not characters.What you want is
encodeURIComponent/decodeURIComponent. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it’s not utf-16 (which you shouldn’t care about as long as flash is concerned).Now, if you want to dig a bit deeper, here’s what’s going on (this assumes a basic knowledge of how utf-8 works).
This returns
Why?
“д” is treated by flash as utf-8. The codepoint for this character is 0x0434.
In binary:
It fits in two utf-8 bytes, so it’s encoded thus (where
emeans encoding bit, andpmeans payload bit):Converting it to hex, we get:
So, 0xd0,0xb4 is a utf-8 encoded “д”.
This is fed to
escape.escapesees two bytes, and gives you:Now, you pass this to
unescape. Butunescapeis a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far asunescapeis concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you’ll see this:So,
unescapereturns a string consisting of two chars,Ðand´(instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passingд´ butд`, and this is what you see in the text area.