Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API (StandardTokenizer) but it’s not specified. Maybe tokenizers do it by default, I don’t know.
Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API
Share
Yes. Lucene tokenizers grab indexable terms from documents, which does not include whitespace. They do preserve the token’s offsets in the original document, though.
This is documented in the docs for
StandardTokenizer:(Whitespace is punctuation.)