Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form XXXX@YYYYY.ZZZZ from that link and all sites underneath, recursively or until some fixed depth.
Given a website, I wonder what is the best procedure, programmatically and/or using scripts,
Share
Using shell programming you can achieve your goal using 2 programs piped together:
An example:
wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O –).
grep is using an extended regular expression (-E) and showing only (-o) email address.
All emails are going to be printed to standard output and you can write them to a file by appending
> somefile.txtto the command.Read the
manpages for more documentation on wget and grep.This example was tested with GNU bash version 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.