I’m looking for a good example of using Regular Expressions in PHP to “reverse engineer” a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let’s assume this is the original plain-text input (taken from a USDA press release):
WASHINGTON, April 5, 2010 – North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture’s
Food Safety and Inspection Service
(FSIS) announced today.
For clarity, the fields that are variables are highlighted below:
[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 – [corp_name=]North
American Bison Co-Op, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture’s
Food Safety and Inspection Service
(FSIS) announced today.
How could I efficiently extract the contents of the
- pr_city
- pr_date
- corp_name
- corp_city
- corp_state
- amount
- product
- reason
fields from my example?
Any help would be appreciated, thanks.
Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):
So, in PHP you could do
This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I’ve tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.