Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API

Question

0

Asked: June 4, 20262026-06-04T08:35:04+00:00 2026-06-04T08:35:04+00:00

Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API

0

Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API (StandardTokenizer) but it’s not specified. Maybe tokenizers do it by default, I don’t know.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T08:35:06+00:00

Editorial Team

2026-06-04T08:35:06+00:00Added an answer on June 4, 2026 at 8:35 am

Yes. Lucene tokenizers grab indexable terms from documents, which does not include whitespace. They do preserve the token’s offsets in the original document, though.

This is documented in the docs for StandardTokenizer:

Splits words at punctuation characters, removing punctuation.

(Whitespace is punctuation.)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Does Lucene’s Standard Tokenizer remove whitespaces and blank lines? I’ve been reading the API

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply