I am using regular expressions to get each line item’s data from a receipt.
The receipts are going to look like this:
Qty Desc
1 JD *#
MARTINI *#
2 XXXXXX
3 YYYYYY
4 JD
PEPSI *#
All items have quantities and descriptions, and some of them have an extra *#. Also, note that the descriptions can have spaces in them, and even more than one line, each line being able to have its own *#. I want to catch the quantity and description (if more than one line, get all lines), and I do not care at all about the extra *#. So in this example, for the first line item I would catch Quantity=1 and Description=”JD MARTINI”. For the fourth, Quantity=4 and Description=”JD PEPSI”.
My current regular expression looks like this:
((\d+)\s+(.*)(\s+\*#)?)
It is not working, and I assume it is because making the last parenthesis optional allows the greedy (.*) to catch absolutely everything. If the last parenthesis wasn’t optional, the regular expression would do its job for the line items with the extra *#, but it wouldn’t match the first and third one (because they don’t have the extra *#).
Any ideas?
After reading your modified question, I have determined that what you wish to accomplish cannot be done with one regular expression. You will have to do a combination of regex match + replace. (see this question: Regular expression to skip character in capture group)
Match Regex: (\d+)\s+([A-Z\s*#]*[A-Z]+)
Replace Regex: (*#(\s*))|(\r\n\s+)(?=\s)
The match regex will match the quantity and the item description, including any in-between line breaks or *# occurrences, leaving out the final *#. I am assuming the last character in a description is a letter.
After you run the match regex, you will get an array of matches back out, which you will need to iterate through to turn into objects. I wrote some handy code to do that for you. For each object, you will run the replace regex on the object’s description, which will remove the extraneous spaces and *#.