I’m trying to use Python 2.7 regex’s to retrieve data from sample web pages that have been provided in a course I’m taking. The code I’m trying to get to work is:
email_patterns = ['(?P<lname>[\w+\.]*\w+ *)@(?P<domain> *\w+[\.\w+]*).(?P<tld>com)
for pattern in email_patterns:
# 'line' is a line of text in a sample web page
matches = re.findall(pattern,line)
for m in matches:
print 'matches=', m
email = '{}@{}.{}'.format(m.group('lname'), m.group('domain'),m.group('tld'))
Running this returns the following error:
email = '{}@{}.{}'.format(m.group('lname'), m.group('domain'), m.group('tld'))
AttributeError: 'tuple' object has no attribute 'group'.
I want to use named groups because the sequence of the groups can change depending on the text I’m matching. However, it doesn’t appear to work because the compiler doesn’t think that ‘m’ is a Group object.
What’s going on here, and how can I get this to work properly by using named groups?
You have two problems. Like Ignacio hinted, you shouldn’t be parsing (X)HTML with regex… regular expressions are not able to handle the complexity. The other problem is that you’re using
findall()instead offinditer().findall()returns the matches as a list… in the event of groups, it returns it as a list of tuples.finditer()on the otherhand returns an iterator ofMatchGroupobjects that has agroup()method.From the python documentation for
re: