I’m trying to to convert an existing PHP regular expression to apply to a slightly different style of document.
Here’s the original style of the document:
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
And the successfully-running PHP Regex match code that only returns “true” if the line is surrounded by asterisks, and stores each side of the “-” as $m[1] and $m[2], respectively.
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
So, for line 1: $m[1] = “FOODS” AND $m[2] = “TYPE A”;
Line 2 would be skipped; Line 3: $m[1] = “PRODUCT”, etc.
The question: How would I re-write the above regex match if the headers did not have the asterisks, but still was all-caps, and was at least 4 characters long? For example:
FOODS - TYPE A
___________________________________
PRODUCT
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
CODE
Sell by date going back to February 1, 2009
Thank you.
Along the lines of (don’t forget the “u” flag for Unicode regexes):
^(?:\*\*)?(?=[^*]{4,})(\p{Lu}+)(?:\s*-\s*(\p{Lu}+))?(?:\*\*)?\s*$^ # start of line (?:\*\*)? # two stars, optional (?=[^*]{4,}) # followed by at least 4 non-star characters (\p{Lu}+) # group 1, Unicode upper case letters (?: # start no capture group \s*-\s* # space*, dash, space* (\p{Lu}+) # group 2, Inicode upper case letters )? # end no capture group, make optional (?:\*\*)? # two stars, optional \s* # optional trailing spaces $ # end of lineEDIT: Simplified, as per the comments:
^(?=[A-Z ]{4,})([A-Z ]+)(?:-([A-Z ]+))?\s*$^ # start of line (?=[A-Z -]{4,}) # followed by at least 4 upper case characters, spaces or dashes ([A-Z ]+) # group 1, upper case letters or space (?: # start no capture group - # a dash ([A-Z ]+) # group 2, upper case letters or space )? # end no capture group, make optional \s* # optional trailing spaces $ # end of lineContents of groups 1 and 2 must be trimmed before use.