Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8844175
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T11:25:13+00:00 2026-06-14T11:25:13+00:00

Looking for strategies on how to tokenize text for search, and some ideas on

  • 0

Looking for strategies on how to tokenize text for search, and some ideas on how to implement them.

Specifically, we are trying to tokenize user generated business reviews to help with our business search engine. All the code is Python.

I think we need to do at least the following:

  • Convert plural nouns to singulars
    I found a library called inflect that seems to do this well, does anyone have any experience with it?

  • Get rid of all non alpha-numeric characters
    This seems like a job for regex to me, but I’d love to hear any other suggestions

  • Tokenize based on whitespace, converting consecutive whitespace into a single whitespace
    I think this is doable with some custom string manipulation in Python, but there may be a better way.

Does anyone have any other ideas about things I’d need to do to tokenize the text? Also, what are your thoughts on the techniques and tools mentioned for implementing the strategies above?

Background info: (from comments to Dough T‘s suggestion about Solr or Elastic search)
We are using ElasticSearch, and we use its tools for basic tokenization. We want to do the tokenization described above separately because, after tokenization, we are going to need to apply some pretty involved semantic analysis to extract meaning from the text. We want the flexibility to tokenize exactly how we specify, and the convenience of having the tokens stored in our own format with our own data annotations attached to them.
One thing that we absolutely need is a single (large) database record for each token, accessible and modifiable on the fly, with everything relevant about that token’s usage in it. I think that rules out just using ES tokenization to process them as the documents get indexed. We could maybe use the ES’s analysis module to analyze the text without indexing it, then process each token individually in order to build/update the token’s database record… We seek suggestions about this approach.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T11:25:14+00:00Added an answer on June 14, 2026 at 11:25 am

    I think you want to look into a full-text search solution that provides the features you describe instead of implementing something your own in python. The two big open-source players in this space are elasticsearch and solr.

    With these products you can configure fields that define custom tokenization, removal of punctuation, synonyms to aid in search, tokenization on more than just whitespace, etc etc. You can also easily add plugins to alter this analysis chain.

    Here’s an example of solr’s schema that has some useful stuff:

    Define Field Types

    <fieldType class="solr.TextField" name="text_en" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>-->
        <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
     </fieldType>
    

    Define a Field

    <field indexed="true" name="text_body" stored="false" type="text_en"/>
    

    You can then work with search server via a nice REST API through python or just use Solr/Elasticsearch directly.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We are looking for some tips over a performance and search engine optimisation (SEO)
I'm trying to extract text from arbitrary html pages. Some of the pages (which
Just looking for some strategies to modify the typical mm/dd/yy 00:00:00 to simply mm/dd/yy
Looking for some branching strategies for a situation that has come up for a
Looking for some strategies for how you guys are loading default data when doing
Looking for strategies for a very large table with data maintained for reporting and
I'm looking for strategies and articles on making Carbon code 64-bit ready. Carbon for
I was looking for new strategies to manage the localization of my future apps,
While looking into parallel programming, and subsequently evaluation strategies, the question whether thunks are
Looking at some assembly code for x86_64 on my Mac, I see the following

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.