I receive a text formatted as html. I want to restrict anchor tag’s urls to be only from my domain replacing the old links with “xxx” (or smth’ else).
Input: “<a href='otherdomain'>text</a>“
Output: “xxx”
I am using regexp to achieve this though I’m kind of stuck here:
$pattern ='/<a.*href=[\'|\"]http.?:\/\/[^mydomain.*\"\']*[\'|\"].*<\/a>/i';
$replace ='xxx';
echo preg_replace($pattern, $replace, $string);
What is wrong here?
When you do
[^mydomain.*\"\']you are saying “match any character except a literal ‘m’, ‘y’, ‘d’, ‘o’, …, ‘.’, ‘*’, etc.Try something like:
Notes:
a.*hreftoa [^>]*\bhrefto make sure that the ‘a’ and ‘href’ are whole words and that the regex doesn’t match over multiple tags./any more((?!mydomain)[^'"])+. This means “match [^'”]+ that isn’t mydomain”. The(?!is called a negative look-ahead.\1. This makes sure that the closing quote mark for the URL is the same as the opening quote mark (see hwo the first set of brackets captures the['"]?). You’d probably be fine without it if you prefered.For PHP (updated because I always mix up when backslashes need to be escaped in PHP — see @GlitchMr’s comment below):
See it in action here, where you can tweak it to your purposes.