Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.
$a = '# /[a-z] #i'; $b = '/ Moo /x'; $c = preg_magic_coalesce('|', $a, $b); // Desired result should be equivalent to: // '/ \/[a-zA-Z] |Moo/'
Of course, doing this as string operations isn’t practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I’m completely happy without this last step. Unfortunately, PHP doesn’t have a RegExp class (or does it?).
Is there any way to achieve this? Incidentally, does any other language offer a way? Isn’t this a pretty normal scenario? Guess not. 🙁
Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I’m doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).
EDIT:
I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can’t make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can’t just concatenate the two because the second expression does not ignore casing and the first doesn’t use the extended syntax (and any whitespace therein is significant!
I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used
., you may want to narrow further).One weakness is it doesn’t handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I’ll leave that as an exercise to the reader/questioner.
Edit: I’ve rewritten this (because I’m OCD) and ended up with:
It now uses
(?modifiers:sub-expression)rather than(?modifiers)sub-expression|(?cancel-modifiers)sub-expressionbut I’ve noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a/umodifier, it will fail to match (but if you pass'u'as the second argument of the new function, that will match just fine).