I want to be able to extract text from text files as tokens – for example, say I have a text file that contains the sentence:
It’s a good restaurant,
believe me!
I want to extract the contents of this as ‘tokens’ – for example, one token would be “It’s”, the next token would be ” “, the one after that would be “a”, then ” “, then “good”, then “restaurant”, then “,” and “\n”, then “believe”, ” “, “me”, “!”. So I guess one way of putting it is that tokens are either words or not words.
Here is what I have so far (I check to see if the token is a word or not elsewhere in the program, this method just returns the next token):
public Token next() {
if (c == -1) {
throw new NoSuchElementException();
}
Writer sw=new CharArrayWriter();
try {
while ( c != -1 && Character.isLetter(c) ) {
sw.write(c);
c = r.read();
}
while ( c != -1 && !Character.isLetter(c)) {
c = r.read();
}
} catch (IOException e) {
c = -1;
return null;
}
return null;
}
Right now I have the return values as ‘null’ because I’m not sure how to use the writer to export it as tokens. Does anyone have any tips for this? Thank you!
I guess that a solution using Matcher class could solve your issue.
Maybe this regex could not be the right one, but you can build a better one. See the Pattern documentation in:
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html