I have extracted html source from a web page, and was wondering how to extract texts like email addresses from that source. Im thinking of using jsoup like
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
but that would give me a lot of unwanted text as well.
You can strip all the tags (unless emails are inside tags). Then either apply regular expression or check every word if it matches an email pattern. I usually mark it as email if it contains
@inside the word and a.is found afterwords. According to standard email format, many emails will not match (eg."hello world@domain.com"). Yes email supports space characters before@!