patterns = {}
patterns[1] = re.compile("[A-Z]\d-[A-Z]\d")
patterns[2] = re.compile("[A-Z]\d-[A-Z]\d\d")
patterns[3] = re.compile("[A-Z]\d\d-[A-Z]\d\d")
patterns[4] = re.compile("[A-Z]\d\d-[A-Z]\d\d\d")
patterns[5] = re.compile("[A-Z]\d\d\d-[A-Z]\d\d\d")
patterns[6] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d")
patterns[7] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d")
patterns[8] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d")
patterns[9] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d\d")
patterns[10] = re.compile("[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d")
def matchFound(toSearch):
for items in sorted(patterns.keys(), reverse=True):
matchObject = patterns[items].search(toSearch)
if matchObject:
return items
return 0
then I use the following code to look for matches:
while matchFound(toSearch) > 0:
I have 10 different regular expressions but I feel like they could be replaced by one, well written, more elegant regular expression. Do you guys think it’s possible?
EDIT: FORGOT TWO MORE EXPRESSIONS:
patterns[11] = re.compile("[A-Z]\d-[A-Z]\d\d\d")
patterns[12] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d\d")
EDIT2: I ended up with the following. I realize I COULD get extra results but I don’t think they’re possible in the data I’m parsing.
patterns = {}
patterns[1] = re.compile("[A-Z]{1,2}\d-[A-Z]{1,2}\d{1,3}")
patterns[2] = re.compile("[A-Z]{1,2}\d\d-[A-Z]{1,2}\d{2,3}")
patterns[3] = re.compile("[A-Z]{1,2}\d\d\d-[A-Z]{1,2}\d\d\d")
Josh Caswell noted that Sean Bright’s answer will match more inputs than your original group. Sorry I didn’t figure this out. (In the future it might be good to spell out your problem a little bit more.)
So your basic problem is that regular expressions can’t count. But we can still solve this in Python in a very slick way. First we make a pattern that matches any of your legal inputs, but would also match some you want to reject.
Next, we define a function that uses the pattern and then examines the match object, and counts to make sure that the matched string meets the length requirements.
Here is some explanation of the above code.
First we use a raw string to define the pattern, and then we pre-compile the pattern. We could just stuff the literal string into the call to
re.compile()but I like to have a separate string. Our pattern has four distinct sections enclosed in parentheses; these will become “match groups”. There are two match groups to match the alphabet characters, and two match groups to match numbers. This one pattern will match everything you want, but won’t exclude some stuff you don’t want.Next we declare a
setthat has all the valid lengths for numbers. For example, the first group of numbers can be 1 digit long and the second group can be 2 digits; this is(1,2)(atuplevalue). A set is a nice way to specify all the possible combinations that we want to be legal, while still being able to check quickly whether a given pair of lengths is legal.The function
check_match()first uses the pattern to match against the string, returning a “match object” which is bound to the namem. If the search fails,mmight be set toNone. Instead of explicitly testing forNone, I used atry/exceptblock; in retrospect it might have been better to just test forNone. Sorry, I didn’t mean to be confusing. But thetry/exceptblock is a pretty simple way to wrap something and make it very reliable, so I often use it for things like this.Finally,
check_match()unpacks the match groups into four variables. The two alpha groups are a0 and a1, and the two number groups are n0 and n1. Then it checks that the lengths are legal. As far as I can tell, the rule is that alpha groups need to be the same length; and then we build atupleof number group lengths and check to see if thetupleis in oursetof validtuples.Here’s a slightly different version of the above. Maybe you will like it better.
Note: It looks like the rule for valid lengths is actually simple:
If that rule works for you, you could get rid of the set and the tuple stuff. Hmm, and I’ll make the variable names a bit shorter.