I’m trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they’re SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1’s, our analytics (Omniture), and our ad tags (doubleclick) are all the same.
My problem is getting meta tags
http://php.net/manual/en/function.get-meta-tags.php
only works if they have a name= attribute, same with “mariano at cricava dot com”‘s solution.
I don’t want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it’s a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.
I have
$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)
but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair
Array
(
[0] => Array
(
[0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
[1] => content
[2] => text/html; charset=UTF-8
)
[1] => Array
(
[0] => <meta name="description" content="some description" />
[1] => content
[2] => some description
)
[2] => Array
(
[0] => <meta property="og:type" content="website" />
[1] => content
[2] => website
)
...
I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)> then doing a second regular expression on the results, but that seems unnecessary with the power of regex?
Not possible with
preg_*/PCRE (nor any other regex flavor that I know of, but in Perl you could use a(?{ push @list, $^N })hack).