I am new to Solr. By reading Solr’s wiki, I don’t understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What’s their real difference?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
They differ in how they split the analyzed text into tokens.
The StandardTokenizer does this based on the following (taken from lucene javadoc):
However, a dot that’s not followed by whitespace is considered part
of a token.
token, in which case the whole token is interpreted as a product
number and is not split.
hostnames as one token.
The WhitespaceTokenizer does this based on whitespace characters:
A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
You should pick the tokenizer that best fits your application. In any case you have to use the same analyzer/tokenizers for indexing and searching!