Looking for strategies on how to tokenize text for search, and some ideas on how to implement them.
Specifically, we are trying to tokenize user generated business reviews to help with our business search engine. All the code is Python.
I think we need to do at least the following:
-
Convert plural nouns to singulars
I found a library called inflect that seems to do this well, does anyone have any experience with it? -
Get rid of all non alpha-numeric characters
This seems like a job for regex to me, but I’d love to hear any other suggestions -
Tokenize based on whitespace, converting consecutive whitespace into a single whitespace
I think this is doable with some custom string manipulation in Python, but there may be a better way.
Does anyone have any other ideas about things I’d need to do to tokenize the text? Also, what are your thoughts on the techniques and tools mentioned for implementing the strategies above?
Background info: (from comments to Dough T‘s suggestion about Solr or Elastic search)
We are using ElasticSearch, and we use its tools for basic tokenization. We want to do the tokenization described above separately because, after tokenization, we are going to need to apply some pretty involved semantic analysis to extract meaning from the text. We want the flexibility to tokenize exactly how we specify, and the convenience of having the tokens stored in our own format with our own data annotations attached to them.
One thing that we absolutely need is a single (large) database record for each token, accessible and modifiable on the fly, with everything relevant about that token’s usage in it. I think that rules out just using ES tokenization to process them as the documents get indexed. We could maybe use the ES’s analysis module to analyze the text without indexing it, then process each token individually in order to build/update the token’s database record… We seek suggestions about this approach.
I think you want to look into a full-text search solution that provides the features you describe instead of implementing something your own in python. The two big open-source players in this space are elasticsearch and solr.
With these products you can configure fields that define custom tokenization, removal of punctuation, synonyms to aid in search, tokenization on more than just whitespace, etc etc. You can also easily add plugins to alter this analysis chain.
Here’s an example of solr’s schema that has some useful stuff:
Define Field Types
Define a Field
You can then work with search server via a nice REST API through python or just use Solr/Elasticsearch directly.