The following python script allows me to scrape email addresses from a given file

Question

0

Asked: May 17, 20262026-05-17T17:08:30+00:00 2026-05-17T17:08:30+00:00

The following python script allows me to scrape email addresses from a given file

0

The following python script allows me to scrape email addresses from a given file using regular expressions.

I’m trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

Can this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = something@whatever.xxx
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

EDIT
When running on http://en.wikipedia.org/wiki/Telephone_number
It produces the following output:

2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T17:08:31+00:00

For one, I would simplify your regex:

(?:\(?\b\d{3}\)?[-.\s]*)?\d{3}[-.\s]*\d{4}\b

will match the same correct numbers as before and have fewer false hits.

Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

(Use the case insensitive option when compiling it).

To restrict yourself to some few TLDs, you can use

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+(?:asia|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[A-Z]{2})\b

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The following python script allows me to scrape email addresses from a given file

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply