Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3425792
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T06:38:11+00:00 2026-05-18T06:38:11+00:00

I am looking for a way to do a fuzzy match using regular expressions.

  • 0

I am looking for a way to do a fuzzy match using regular expressions. I’d like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words “New York” preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I’d like to match:

12 New York
24 Hew York
33 New Yobk

and other “close” matches (in the sense of the Levenshtein distance), but not:

aa New York
11 Detroit

Obviously, I will need to specify the allowable distance (“fuzziness”) for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I’m really trying to match, so I’m not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn’t mean for people to get hung up on the preceding digits — sorry about the bad example. Here’s a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase “ALUSCHALME&S MANOTAC/rURINGCOMPANY” and “DELAY/ABE”. (I realize this might seem like madness. But I’m an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T06:38:11+00:00Added an answer on May 18, 2026 at 6:38 am

    If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

    Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like “York”, but if your typical pattern is a whole address, that should be fine.

    Try it like this:

    • transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by “#” or something.
    • choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
    • than, build a qgram-profile of each text:
    • split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
    • if you search for your pattern then, you do the same with your pattern,
    • loop through your text-qgram-database and
      • count for each pattern/text-pair how many qgrams are the same.
      • each hit will raise the score by 1.
    • the texts with the highest score(s) are your best hits.

    If you did that you can tweak this algorithm by:

    • prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
    • You can even play around with replacing all consonants with “x” and vowels with “o” and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
    • when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
    • store 2-grams and 3-grams both, and when counting, weigh then differently.

    Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am playing with TFS 2010, and am trying to setup a build process

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.