I’m tring to extract email adressess from a content. I’ve a problem about false positives.
My regex for: example@site.com
[^\.^\w+](\w+) *?@ *?(\w+) *?(?:\.|dot) *?(\w+)
Regex for: example@sub.site.com
[^\.^\w+](\w+) *?@ *?(\w+) *?(?:\.|dot) *?(\w+) *?(?:\.|dot) *?(\w+)
I want the first regex not to match with:
example@sub.site
How can I fix it?
The only way to distinguish example@site.com and example@sub.site is to maintain a list of valid top level domains (yes, I’m sorry).
i.e, replacing your last
(\w+)by(com|org|info|ly|...and so on.There is no universal way.
Also, you could do only one regex.
Also, my address could be example@sub1.sub2.site.com, be careful…