So, let’s say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what’s bothering me is how to phrase this in a regex? I’m using PHP, and this is what I’m doing right now:
$temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex’ing is horrible.
I assume your input looks similar to this:
If you use
preg_match_allthere is no need forexplodeor to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:There will also be a
$matches[4]but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:(.+)will consume as much as possible, and any character. So if you have something inside a string value that looks likeIN 13then your first repetition might consume everything until there and return it as the column. It also allows whitespace and=inside column names. There are two ways around this. Either making the repetition “ungreedy” by appending?or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.[\t| ]this mixes up two concepts: alternation and character classes. What this does is “match a tab, a pipe or a space”. In character classes you simply write all characters without delimiting them. Alternatively you could have written(\t| )which would be equivalent in this case.[.+]I don’t know what you were trying to accomplish with this, but it matches either a literal.or a literal+. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid'some string")Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
This makes use of PCRE’s recursion feature where you can reuse a subpattern (or the whole regular expression) with
(?n)(wherenis just the number you would also use for a backreference).I can think of three major things that could be improved with this regex:
'don\'t do this', I would only captur'don\'). This can be solved using a negative lookbehind.?)I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex’ing is horrible… while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!