I’m curious what is the best method do deal with tokenizing/indexing terms (In Lucene) or any search engine for that matter so that these searches would match corresponding terms.
“12” = “twelve”
“mx1” = “mx one”
Is there any built-in functionality I’ve overlooked?
The simplest way in Lucene is to create 2 separate token filters to use after the initial string has been tokenized. The first one needs to split between sequences of digits and non-digits. The second one would then convert numbers (digit strings) into their numerical (spelled) numbers.
Here’s an example with PyLucene (excluding offset and position attribute logic):