Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it’s not very well considered, but anyway it’s useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password@domain:port/path?query_string#anchor
but this format doesn’t seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the “unofficial” format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it’s clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don’t want to decode ‘+’ to spaces, for example.
My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters
< > &(example 3 is valid XHTML 1.0 Strict).Now, is example (2) a valid
javascript:URI ? I’m not sure, but I’d say it’s not.From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via
%xxsequences. And some characters are always prohibited:among them spaces and
{}#.The RFC also defines a subset of
opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a ‘query string’, so the?can be used as any non special character). I assumejavascript:URIs should be considered among them.This would imply that the valid characters inside the ‘body’ of a
javascript:URI arewith the additional restriction that it can’t begin with
/.This stills leaves out some “important” ASCII characters, for example
Also
%(because it’s used for escape sequences), double quotes"and (most important) all blanks.In some respects, this seems quite permissive: it’s important to note that
+is valid (and hence it should not be ‘unescaped’ when decoding, as a space).But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as “%20”. Is there any (empirical or theorical) explanation for this?
I still don’t know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.