I am reading about SOLR and indexing a MySQL database into SOLR.
What do they mean by “tokenize” and “un-tokenize”?
And what does it mean when fields are “normalized”?
I know how and what it means to normalize a database, but a field?
How can a simple field be normalized?
Thanks
Tokenizing a field enables full text search, i.e. finding any word that occurs anywhere in the field. An Untokenized field will be found only when you have a complete and exact match, e.g. if the field’s content is “blue moon” then it will only be found when you search for “blue moon”, not when you search only for “blue”.
This most likely refers to Unicode normalization – Unicode has separate code points for diacritics, e.g. U+0060 is ` (grave accent), so the accented letter è could either be one Unicode character (U+00E8) or composed of two (U+0060 and U+0065). But of course you want both to be found when you search for è.