I’m using phpBB3 to make a message board. There is a built in feature that takes all URLs in posts and renders then as links. I want to make it so that ONLY local links are made clickable.
phpbb3 uses regex on the text of a post and for each match changes it to a link:
if ($somestuff){
// matches a xxxx://aaaaa.bbb.cccc. ...
$magic_url_match[] = '#(^|[\n\t (>.])(' . "[a-z]$scheme*:/{2}(?:(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})+|[0-9.]+|\[[a-z0-9.]+:[a-z0-9.]+:[a-z0-9.:]+\])(?::\d*)?(?:/(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})*)*(?:\?(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?(?:\#(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?" . ')#ie';
$magic_url_replace[] = "make_clickable_callback(MAGIC_URL_FULL, '\$1', '\$2', '', '$class')";
// matches a "www.xxxx.yyyy[/zzzz]" kinda lazy URL thing
$magic_url_match[] = '#(^|[\n\t (>])(' . "www\.(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})+(?::\d*)?(?:/(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})*)*(?:\?(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?(?:\#(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?" . ')#ie';
$magic_url_replace[] = "make_clickable_callback(MAGIC_URL_WWW, '\$1', '\$2', '', '$class')";
}
return preg_replace($magic_url_match, $magic_url_replace, $text);
How can I rewrite these regex so that they only match links on my domain? Additionally, what is the best way to teach myself regex?
This is the first one, broken up section by section. Even doing this was non-trivial…
OK, here we simply have "beginning of the line, or after a newline, tab, space, greater than, period. Just anchoring the regex.
This is pure insanity right here.
$schemepresumably holdshttp, which means that this regex matches thehttp://. Why someone would use/{2}instead of//, I cannot begin to guess.This matches a series of characters, presumably those that are legal in a URL. Of note is the
$inlinePHP variable – can’t guess what that holds – and the second alternative,%[\dA-F]{2}. That matches things like%20for a space, etc. The%sign is not otherwise legal in the match (or in a URL).Also important here is that
/is not legal. This, therefore, cannot refer to directories, only to the domain. This is most likely the part you want to change, to simply match the appropriate domain of your website.For completeness’s sake, though, here’s the rest.
Alternatively, we could have a series of digits and periods – an IP address. Considering how complicated this regex is, I’m surprised he didn’t go for
(?:\d{1,3}\.){3}\d{1,3}…Here’s our last alternative; I think this is for IPv6. It’s a series of hexadecimal numbers separated by colons, anyway. It requires that these be within square brackets, which I find odd, especially for a forum software that uses those so heavily for tags…
Here, we get the option of some digits following a colon. That is, this is for URLs that have a port in them.
OK, here we’ve gotten to the subdirectories, as shown by the
/at the beginning. Otherwise, this is the same "legal URL characters" match.Finally, things that are being passed by
GET, indicated by the\?, and URLs linking to a mid-page anchor, indicated by the\#.Bottom line:
This section:
Should be replaced with something like this:
Or maybe
Where the domain and the IP addresses match your website. Obviously, you’re going to have to remove the line breaks and indentation I did. I’d do it for you, but I think it’s almost not worth it because you’ll have a hard time finding the spot where you put your domain in the middle of all that.
You’ll probably want to include some regex for subdomains or people leaving out the
www.or what have you.You may also want to remove this:
As you probably don’t want people linking to other ports on your domain.
The second one looks to have roughly the same structure; as the comment says, it’s just getting URLs that lack the protocol designator.