I want to parse a robots.txt file and extract the sitemap reference. Assuming that the file is something like this;
stuff
foobar
Sitemap: http://www.cgdomestics.co.uk/sitemap.xml
hello world
more stuff
I’m trying to use regex to extract exactly this;
http://www.cgdomestics.co.uk/sitemap.xml
So far I have this PHP code;
<?php
$robots_url = "http://www.cgdomestics.co.uk/robots.txt";
$robots_file = file_get_contents($robots_url);
$pattern = "/Sitemap: .*/";
$i = preg_match($pattern, $robots_file, $match);
echo $match[0];
?>
The output of the above is;
Sitemap: http://www.cgdomestics.co.uk/sitemap.xml
but I want it to output only;
http://www.cgdomestics.co.uk/sitemap.xml
Can I use regex to return exactly what I want or do I need to do another step to remove the “Sitemap: ” part? Or is there a better way to do this?
As you can probably tell I’m an infrequent user of PHP and regex.
Thanks.
Nigel
Set a sub pattern and grab it from the matches array