Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8870883
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T17:50:47+00:00 2026-06-14T17:50:47+00:00

I’m looking for an efficient data structure to do String/Pattern Matching on an really

  • 0

I’m looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I’ve found out about tries, suffix-trees and suffix-arrays. But I couldn’t find an ready-to-use implementation in C/C++ so far (and implementing it by myself seems difficult and error-prone to me). But I’m still not sure if Suffix-Arrays are really the thing I’m looking for… I’ve tried libdivsufsort and esaxx, but couldn’t find out how to use them for my needs:

I want to use an predefined set of strings, with wildcards (or even regular expressions) to match an user input. I got a huge list of predefined strings i.e.

“WHAT IS *?”
“WHAT IS XYZ?”
“HOW MUCH *?”
…

Now I want to find the best matching string (if there’s one, that matches at all).
I.e.
User input: >WHAT IS XYZ?
Should find “WHAT IS XYZ?” instead of “WHAT IS *?”, but “WHAT IS SOMETHING?” should find “WHAT IS *?” (assuming * is a wildcard for any count of characters).

Building the structure isn’t time critical (and the structure don’t have to be super space efficient), but the search shouldn’t take too long. How can that be done easily? Any Framework/Library or code example is welcome

Thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T17:50:49+00:00Added an answer on June 14, 2026 at 5:50 pm

    Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.

    The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:

    1: what is *
    2: where is *
    3: do * need to
    etc.
    

    And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:

    what  -> 1
    is    -> 1,2
    where -> 2
    do    -> 3
    need  -> 3
    to    -> 3
    

    Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:

    INPUT: what is something?
    
    TOKENS:
       what      -> 1
       is        -> 1,2
       something -> n/a
    

    You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>).

    You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.

    You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.

    The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).

    Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:

    what is *||where is *||do * need to||...
    

    Then you construct a suffix array of this concatenated string.

    Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols * and $ are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.

    Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).

    Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).

    Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I want to count how many characters a certain string has in PHP, but
For some reason, after submitting a string like this Jack’s Spindle from a text
I am reading a book about Javascript and jQuery and using one of the
I've got a string that has curly quotes in it. I'd like to replace
Specifically, suppose I start with the string string =hello \'i am \' me And
I would like to count the length of a string with PHP. The string
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.