I’ve run into a text processing problem. I’ve an article, and I’d like to find out how many “real” words there are.
Here is what I mean by “real”. Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I’d like to find out is how many words there are, skipping like “-” dashes and “,” commas with spaces, etc.
I tried doing this:
my @words = split ' ', $article;
print scalar @words, "\n";
But that includes various punctuations that have spaces in them as words.
So I’m thinking of using this:
my @words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar @words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use:
\W– any non-word character, and also drop _Solution