Generally I’d match HTML attributes with this regex
\w+=".*?"
but when the HTML contains PHP code it gets kind of dicy. Please consider the following tag:
<option value="<?php echo $img; ?>"<?php echo ($hpb[$i]['image_filename']==$img?' selected="selected"':''); ?>>
<?php echo $img; ?>
</option>
the above regex will match the attribute selected="selected" which is determined inside PHP logic. Is there a way to match attributes which are not inside PHP tags while still matching the ones whose value may contain PHP logic? If not could I just remove the PHP code which isn’t part of an attribute value?
EDIT: Here’s what I have so far:
\w+="(((.(?!<\?php))*?)|((.((?=<\?php).*?(?=\?>))*)*?))*"
Which basically means match a string which starts with a SPACE then greedily match alphanumeric characters followed by EQUALS sign followed by double quote and then match any of the following two while capturing as many characters as possible:
- A sequence of characters which does not contain the string
<?php - A sequence of characters containing the pattern
<\?php.*?\?>or in other words greedily match the value part of the attribute with all of its PHP code
All of that till a closing double quote is encountered…
This will match either a PHP code segment or a complete
attribute="value"sequence in which the value may contain PHP code. After each match you can find out what you caught by checking the contents of the capturing groups. If it’s a pure PHP segment you matched, all butgroup[0]will be empty; otherwise,group[1]will contain the attribute name andgroup[2]will contain the value.The regex assumes
<will appear inside an attribute value only as the beginning of a<?phptag. Of course that’s not a syntactically valid assumption, but it’s probably safe anyway. I can make the regex more precise if you need me to, but it will be also be much less readable.