I’ve looked all over and have yet to find a single solution to address my need for a regular expression pattern that will match a generic URL. I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings. Some examples:
Ideally, I’d like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.
(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.)
Nicholas Carey is correct to steer you towards RFC-3986. The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of “the wild” – it is too loose and matches just about any string including an empty string).
Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:
Regular Expression URI Validation
Regarding the subject of picking out URL’s from the “wild”, take a look at Jeff Atwood’s “The Problem With URLs” and John’ Gruber’s “An Improved Liberal, Accurate Regex Pattern for Matching URLs” blog posts to get a glimpse as to some of the subtle problems which can arise. Also, you may want to take a look at a project I started last year: URL Linkification – this picks out unlinked HTTP and FTP URLs from text which may already have some links.
That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 “Absolute URI” regex to validate HTTP and FTP URL’s (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code:
The first regex validates the string as an absolute (has a non-empty host portion) generic URI. A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)
Note that the structure of this function allows easy expansion to include other schemes.