I am doing some regular expressions in php and matching using the preg_match();
I have a text that might look something like this:
$imy = "...without sophisticated apparatus<div class="caption"><div class="caption-inner">
<img src="http://dev.mysite.org/Heatmap.png" alt="" title="" class="image-thumbnail" />
Caption text</div></div>Some more text...
<img src="http://dev.mysite.org/Heatmap.png" alt="" title="" class="image-thumbnail" />blablah..."
and my goal is to pick out either the “img” tag enclosed in the “div” tags(including the “div” tags) or just the “img” if it is not enclosed in divs. I also in each case want to capture the address contained in the src attribute of the “img” tag.
This is the pattern I use:
$imagepattern = '/<div class="caption-inner[^>]+>.*<img\b[^>]*\bsrc="([^">]*)"[^>]*>.*<\/div>(<\/div>)?|<img\b[^>]*\bsrc="([^">]*)"[^>]*>/Us';
and it works great for “div” enclosed images, but for the divless images I get weird results for the captured subpattern.
I iteratively call preg_match and remove the match from the subject string before resending it to preg_match. My call to preg_match looks like this:
preg_match($imagepattern,$imy,$image,PREG_OFFSET_CAPTURE)
What I get in my image array when matching against a divless image tag looks like this:
$image = [0] => Array
(
[0] => <img src="http://dev.molmeth.org/Heatmap.png" alt="" title="" class="image-thumbnail" />
[1] => 1
)
[1] => Array
(
[0] =>
[1] => -1
)
[2] => Array
(
[0] =>
[1] => -1
)
[3] => Array
(
[0] => http://dev.mysite.org/Heatmap.png
[1] => 11
)
How can the $image array have the ‘2’ and ‘3’ keys? Don’t I only have one subpattern? Is this somehow because of the ‘or’ condition in the pattern?
in your preg_match expression you have 3 capture groups.
the whole expression matches because of the or (since you search div included images OR divless images)
for divless images, only capture group 3 will be filled data and capture groups 1 & 2 will be empty.