I am trying to figure out how to capture one statement if the other one doesn’t exist using preg_match.
Sample Text:
<!-- InstanceBeginEditable name="doctitle" -->
<title>BU Libraries | Research Guides | Citing Your Sources</title>
<!-- InstanceEndEditable -->
<div id="standardpgt"><h1><!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable --></h1></div>
Because pagetitle exists I want to pull it instead of the doctitle tag. Of course there is tons of other characters in between them, but I wanted to show you a small sample.
If pagetitle didn’t exist I would want to grab the contents of doctitle.
The twist is that I’m not using the php code directly, I’m passing in a regex statement through a config file, then a script is taking it and pulling out the 1st group from the statement.
This is what I came up with:
((?!.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->)<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->|<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->)
What the issue is for some reason php always reads the first empty group as group 1 if it didn’t work.
For example in the sample text above it would return
0 -> <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
1 ->
2 -> <strong>Citing Your Sources</strong>
I can’t for the life of figure out how to make this work. I also wrote this regex:
(?(?=.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->).*?<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->|.*?<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->)
But that didn’t work either. Thank you very much for the help.
Chris
user178551 is absolutely correct in recommending the use of a branch reset construct. There is fundamentally nothing wrong with your original regex (other than the fact that it is more than 300 characters long and is ALL ON ONE LINE! – and that it is unable to put one of two alternatives in a single capture group). A non-trivial (to put it mildly) regex like this needs to be written in free-spacing mode with indentation so you can actually read it. Here is your original regex with some reasonable whitespace added:
Looking at this regex now, you can see where you have hard coded one space on the line with the OR operator (i.e.
|<!-- InstanceBegin...). This will cause the regex to fail to match with the'x'modifier is applied. So replacing this space with a\s*and running it on your test data, here are the result I get (php-5.2.14):These results are similar to the ones you posted (but for some reason your results show only 2 capture groups???) All we need to do now is to apply user178551’s branch reset suggestion, and the regex solution becomes:
I’ve gone ahead and changed all the lazy
\s*?to greedy (because greedy is what you want here). I also changed all the\x22to just"– shorter and more readable IMHO. And here are the results from running with this new, branch reset regex:Which is, (if I’m not mistaken), exactly what you are looking for. (You did not provide a test case for the other alternative so that has not yet been tested.) Other than that, your original regex was pretty close.