i am looking for all the delimiters on which java lucene standard analyzer tokenizes the input string.
need to know all delimiters that are by default used for tokenizing.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I know (from Lucene in Action) that all characters which are not a-zA-Z or variatons of a-zA-Z that have diacritics are used as delimiters, including numbers.
So you might have Mc’Donald splitted in “Mc” “Donald”, you might have “Web2.0” tokenized as “Web”, and so on.
The best is to do a test and enter all kinds of characters and then post your results here.