I’m having trouble with grep.. Which four patterns should I use with PHP’s preg_grep to extract all instances the “__________” stuff in the strings below?
1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>
The dots denote some arbitrary characters while the underscores denote what I want.
An example string is:
</style></head>
<body><div id="adBlock"><h2><a href="https://www.google.com/adsense/support/bin/request.py?contact=afs_violation&hl=en" target="_blank">Ads by Google</a></h2>
<div class="ad"><div><a href="http://www.google.com/aclk?sa=L&ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&num=1&sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="titleLink" target="_parent">Spider-<b>Man</b> Animated Serie</a></div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite><a href="http://www.google.com/aclk?sa=L&ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&num=1&sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="domainLink" target="_parent">www.Crackle.com/Spiderman</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&num=2&sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="titleLink" target="_parent">Kids <b>Batman</b> Costumes</a></div>
<span>Great Selection of <b>Batman</b> & Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite><a href="http://www.google.com/aclk?sa=l&ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&num=2&sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="domainLink" target="_parent">www.CostumeExpress.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&num=3&sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&q=http://www.OfficialBatmanCostumes.com" class="titleLink" target="_parent"><b>Batman</b> Costume</a></div>
<span>Official <b>Batman</b> Costumes.
<br>
Huge Selection & Same Day Shipping!</span>
<cite><a href="http://www.google.com/aclk?sa=l&ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&num=3&sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&q=http://www.OfficialBatmanCostumes.com" class="domainLink" target="_parent">www.OfficialBatmanCostumes.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&num=4&sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="titleLink" target="_parent">Discount <b>Batman</b> Costumes</a></div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>
<cite><a href="http://www.google.com/aclk?sa=l&ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&num=4&sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="domainLink" target="_parent">www.discountsuperherocostumes.com</a></cite></div></div></body>
<script type="text/javascript">
var relay = "";
</script>
<script type="text/javascript" src="/uds/?file=ads&v=1&packages=searchiframe&nodependencyload=true"></script></html>
Thanks!
First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !
Here, you could use :
DOMDocument::loadHTMLto load the HTML string@operator to silence warnings, as your HTML is not quite valid.DOMXPathclass to do XPath queries on the documentFor example, you could load your document, and instanciate the
DOMXpathclass this way :And, then, use XPath to find the elements you are looking for.
For example, in the first case, you could use something like this, to find all
<a>tags that are children of<h2>tags :Then, for the second and third case, you are searching for
<a>tags that are children of<cite>tags — and when you’ve found them, you want to check if they have ahrefattribute or not :And, finally, for the last one, you just want
<span>tags :Not that hard — and much easier to read that regexes, isn’t it ? 😉