This simple problem is kiling me. I posted something earlier about trying to clean up a database of addresses and somebody suggested GeoPy to check the validity of the addresses. Great tool which I did not know, but before doing that, I need to clean up the database a little bit, since geopy will not deal with messy formatting.
The solution is to use regular expressions, which I think I have sort of fixed for most of the types of addresses I seen in the database.
Nevertheless, I am having problems with the last RegExp I defined (called r4 in the code), because it is retuning part of the first parenthesis which I don’t need, and I don’t know why I have a extra white spaces when it returns the last group (City: London, Country: England).
Can anybody help?
import re
r1 = '\s*ForeignZip.*--\s*([\d\.]+)'
r2 = '(\w+)\W*,\W*(\w*)'
r3 = '(?<=\().*?(?=\))'
r4 = '(\w+\W\()'
Location = [' ForeignZip (xxx) -- 734.450','Washington, DC.','London (England)']
for item in Location:
print item
match1 = re.search(r1,item)
match2 = re.search(r2,item)
match3 = re.search(r3,item)
match4 = re.search(r4,item)
if match1:
print 'pattern 1 found:', match1.group(1)
elif match2:
print 'pattern 2 found: City :' + match2.group(1) + ", State :" + match2.group(2)
elif match3:
print 'pattern 3 found: City: ', match4.group() + ", Country :" + match3.group(0)
else:
print 'no match'
This returns
ForeignZip (xxx) -- 734.450
pattern 1 found: 734.50
Washington, DC.
pattern 2 found: City :Washington, State :DC
London (England)
pattern 3 found: City: London (, Country :England
Just a little changing of your later regexes is necessary … There are probably a million ways to do this, but here is one.:
Or to make whitespace completely insignificant:
which basically is the previous regex except it replaces
\w+with(?:\w+\s*)*which allows multiple words to be matched, but doesn’t capture them — it leaves the “groups” the same since(?:...)never saves the string it matched anywhere.and now change the third test to:
I also removed r4 since it isn’t necessary anymore… (Also changed the ‘,’ to a ‘+’ for consistency and added a space in ‘City:’)
Also note that when dealing with regex, it is often nice to use “raw” strings (this prevents python from mangling tokens in your string. To test the difference, try: