Let’s say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt
Does anyone has a good algorithm, or open-source code, to extract words from a text file? How to get all the words, while avoiding special characters, and keeping things like ‘it’s’, etc…
I’m working in Java. Thanks
This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don’t know how to start:
The pattern
[\w']+matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.