im trying to tokenize text files using the following code:
String fileContent = "";
String fileContentTokens[];
try{
fileContent = new Scanner(new File(fname)).useDelimiter("\\Z").next();
} catch(Exception ex) {
System.out.println(ex.getMessage());
}
fileContent = fileContent.replaceAll("\\s*([,.?!\"'()-:*;])\\s*", " $1 ");
//System.out.println(fileContent);
fileContentTokens = fileContent.split(" ");
The problem is that the tokens are not forming properly, by that i mean that some words still have quotations attached with them some still have apostrophes. The code above is supposed to put gaps in between every punctuation so it’s not attached to the word it self. For example: “That’s cool” is supposed to be ” That ‘ s cool “. But it’s not doing that for some reason. It’s only doing this for some of the words not all.
You have another type of apostrophes in your string where it is failing.
In that string you have
’But in your regex you have
'Those are differrent. Add former apostrophe to your regex also, and it will work: