The usual method of URL-encoding a unicode character is to split it into 2

Question

0

Editorial Team

Asked: May 10, 20262026-05-10T17:41:57+00:00 2026-05-10T17:41:57+00:00

The usual method of URL-encoding a unicode character is to split it into 2

0

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)

But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 (‘Aa’)?

Are 8-bit characters, that require encoding, preceded by %00?

Or, is the point that unicode characters are supposed to be lost/split?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T17:41:58+00:00

According to Wikipedia:

Current standard

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.

So, it looks like its entirely up to the person writing the unencode method…Aren’t standards fun?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The usual method of URL-encoding a unicode character is to split it into 2

Leave an answerCancel reply

1 Answer

Current standard

Non-standard implementations

Leave an answer
Cancel reply