I like to extract the words from the text. I have written the simple regex.
my $regex = qr[\W];
while(<DATA>){
push @words, split $regex;
}
I like to modify it to include proper names. Proper names may combine multiple ‘words’. For example..
@names = ('John Smith', 'Joe Smith');
I don’t think there is a definitive solution. The regular expression is limited in a complex text space like a web page or book with many anomalies, e.g. what about book titles? Look at using either 1) natural language processing or 2) An index approach where you identify two words, starting with capital letter, split by one space, and see if one of them is contained with an index of known first or last names. good luck.