I have a form that is accepting URLs from users in PHP.
What characters should I allow or disallow? Currently I use
$input= preg_replace(‘/[^a-zA-Z0-9-\?:#.()\,/\&\’\\’]/’, ”, $string);
$input=substr($input,0,255);
So, it’s trimmed to 255 chars and only can include letters, numbers, and ? – _ : # ( ) , & ‘ ‘ /
Anything I should be stripping that I’m not, or anything I’m stripping that might need to be in a valid URL?
RFC 1738 which defines the URL specification states that only the characters
may be used within a URL scheme, and only the characters
may be used unencoded within the scheme-specific part of a URL. (
;/?:@=&, if used unencoded, must be used for their ‘reserved purposes’, but if you’re just checking for invalid characters you don’t need to worry about that). So if you want full generality, I’d check the URL against this regex:(probably some of that escaping is not necessary). If you’re only looking for HTTP URLs, (some of) the other answers should be fine.