The following python script allows me to scrape email addresses from a given file using regular expressions.
I’m trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
Can this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something@whatever.xxx
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
EDIT
When running on http://en.wikipedia.org/wiki/Telephone_number
It produces the following output:
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261
For one, I would simplify your regex:
will match the same correct numbers as before and have fewer false hits.
Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:
(Use the case insensitive option when compiling it).
To restrict yourself to some few TLDs, you can use