I am trying to scan some documents to find dates for a classification problem. After reading around here and some other places I have constructed the following regular expression
months='['+'|'.join(calendar.month_abbr[1:])+'|'+'|'.join(calendar.month_name[1:])+']'
techPart='+\\.*\\s*\\d{1,2}[,]?[\\s*][1|2]\\d{3}'
dateExpr=months+techPart
I am testing it on this string
newString='Mar. 31, 2011 Dec. 31, 2010 bananas Mar. 31, 2011 too much malarky September 1, 1992 redundant Dec. 31, 2010 September 29, 1999 March 12 2004 ddfd March. 13 2019 ddfd Mac. 13 2019 ddfd'
and when I run it like this
for date in re.findall(dateExpr,newString):
print date
I get this
Mar. 31, 2011
Dec. 31, 2010
Mar. 31, 2011
September 1, 1992
Dec. 31, 2010
September 29, 1999
March 12 2004
March. 13 2019
Mac. 13 2019 #here is my problem
In your
monthsregex, you are using square brackets, giving something like[Jan|Feb|Mar|...]. That is wrong. Square brackets are for character classes and match one of any character in the brackets, so this will matchJoraornor|orF, etc. Instead you want to use parentheses:You need the
?:becausefindallreturns only captured groups, so we need to mark this group as noncapturing.You have the same problem later in your regex where you do
[1|2]. You want(?:1|2), or just[12].