I am trying to parse out all dates (possibly written in different forms) from a string. The problem is that there could be a date written in this form d/m -y for example 22/11 -12. But there could also be a date written in this form d/m with no year specified. If I find a date in this string which contains the longer form, I don’t want it to be found again in the shorter form. This is where my code fails, it finds the first date twice (one time with the year, and one time without it).
I really have two questions: (1) What is the “right” way of doing this. It really seems that I coming at this problem from the wrong angle. (2) If I should stick to this way of doing this, how comes that this line datestring.replace(match.group(0), '') don’t remove the date so I could’nt be found again?
This is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dformats = (
'(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})',
'(?P<day>\d{1,2})/(?P<month>\d{1,2})',
'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
)
def get_dates(datestring):
"""Try to extract all dates from certain strings.
Arguments:
- `datestring`: A string containing dates.
"""
global dformats
found_dates = []
for regex in dformats:
matches = re.finditer(regex, datestring)
for match in matches:
# Is supposed to make sure the same date is not found twice
datestring.replace(match.group(0), '')
found_dates.append(match)
return found_dates
if __name__ == '__main__':
dates = get_dates('1/2 -13, 5/3 & 2012-11-22')
for date in dates:
print date.groups()
Two ways:
Use a single regular expression and use the | operator to join all your cases together:
expr = re.compile ( r"expr1|expr2|expr3" )Find only single instances, and then pass a “start position” for the next search. Note that this will complicate things, since you want to always start with the earliest match, no matter which format is chosen. Ie, loop through all three matches, figure out which is the earliest match, do the replacement, then do it again with a incremented start position. This makes option 1 much easier regardless.
A few additional points:
Make sure you’re using “raw strings”: prepend an ‘r’ at the front of each of the strings. Otherwise the ‘\’ characters risk getting eaten up and not passed to the RE engine
Consider using “sub” and a callback function in place of the “repl” parameter to do the replacement, rather than finditer. “repl” in this case is passed a match object, and should return the replacement string.
Matching groups in your “single” re will have the value None if that alternative was not chosen, making it easy to detect which alternate was used.
You should not say “global” unless you intend to modify that variable.
Here’s some complete, working code.