String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS. ";
StringTokenizer st = new StringTokenizer(a);
while (st.hasMoreTokens()){
System.out.println(st.nextToken());
Given above codes, the output is following,
the
STRING TOKENIZER CLASS
ALLOWS
an
APPLICATION
to
BREAK
a
STRING
into
TOKENS.
My only question is why the “STRING TOKENIZER CLASS” has been combined into one token????????
When I try to run this code,
System.out.println("STRING TOKENIZER CLASS".contains(" "));
It printed funny result,
FALSE
It sound not logical right? I’ve no idea what went wrong.
I found out the reason, the space was not recognized as valid space by Java somehow. But, I don’t know how it turned up to be like that from the front processing up to the code that I’ve posted.
Guys, I need to highlight that, below code runs first before the above one..
if (!suspectedContentCollector.isEmpty()){
Iterator i = suspectedContentCollector.iterator();
String temp=””;
while (i.hasNext()){
temp+=i.next().toLowerCase()+ ” “;
}
StringTokenizer st = new StringTokenizer(temp);
while (st.hasMoreTokens()){
temp=st.nextToken();
temp=StopWordsRemover.remove(temp);
analyzedSentence = analyzedSentence.replace(temp,temp.toUpperCase());
}
}
Hence, once it has been changed to UPPERCASE, something seems to went wrong somewhere and I realized only certain spaces were not recognized. Could it be the reason of retrieving the text from the document?
Following code,
String a =”the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS. “;
for (int i : a.toCharArray()) {
System.out.print(i + ” “);
}
produced following output,
116
104
101
32
83
84
82
73
78
71
160
84
79
75
69
78
73
90
69
82
160
67
76
65
83
83
32
65
76
76
79
87
83
32
97
110
32
65
80
80
76
73
67
65
84
73
79
78
32
116
111
32
66
82
69
65
75
32
97
32
83
84
82
73
78
71
32
105
110
116
111
32
84
79
75
69
78
83
46
160
32
Looking at the character codes, the ‘space’ in question is 0xA0, which is intended to be a non-breaking space. My guess is that it was entered deliberately so that ‘STRING TOKENIZER CLASS’ is treated as one word.
The solution (if you indeed deem it correct to break up ‘STRING TOKENIZER CLASS’ into three words) would be to pass add the non-breaking space as delimiter to the StringTokenizer class (resp. the String.split() method). E.g.