Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8665541
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T17:30:18+00:00 2026-06-12T17:30:18+00:00

I have a script I found on here that works well when looking for

  • 0

I have a script I found on here that works well when looking for the Lowest Common Substring.

However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable.

For example, I want to find this string:

big yellow school bus

inside of this string:

they rode the bigyellow schook bus that afternoon

This is the code i’m currently using:

function longest_common_substring($words) {
    $words = array_map('strtolower', array_map('trim', $words));
    $sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
    usort($words, $sort_by_strlen);

    // We have to assume that each string has something in common with the first
    // string (post sort), we just need to figure out what the longest common
    // string is. If any string DOES NOT have something in common with the first
    // string, return false.
    $longest_common_substring = array();
    $shortest_string = str_split(array_shift($words));

    while (sizeof($shortest_string)) {
        array_unshift($longest_common_substring, '');
        foreach ($shortest_string as $ci => $char) {
            foreach ($words as $wi => $word) {
                if (!strstr($word, $longest_common_substring[0] . $char)) {
                    // No match
                    break 2;
                }
            }
            // we found the current char in each word, so add it to the first longest_common_substring element,
            // then start checking again using the next char as well
            $longest_common_substring[0].= $char;
        }
        // We've finished looping through the entire shortest_string.
        // Remove the first char and start all over. Do this until there are no more
        // chars to search on.
        array_shift($shortest_string);
    }

    // If we made it here then we've run through everything
    usort($longest_common_substring, $sort_by_strlen);

    return array_pop($longest_common_substring);
}

Any help is much appreciated.

UPDATE

The PHP levenshtein function is limited to 255 characters, and some of the haystacks i’m searching are 1000+ characters.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T17:30:19+00:00Added an answer on June 12, 2026 at 5:30 pm

    Writing this as a second answer because it’s not based on my previous (bad) one at all.

    This code is based on http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm and http://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms

    It returns one (of potentially several) minimum-levenshtein substrings of $haystack, given $needle. Now, levenshtein distance is just one measure of edit distance and it may not actually suit your needs. ‘hte’ is closer on this metric to ‘he’ than it is to ‘the’. Some of the examples I put in show the limitations of this technique. I believe this to be considerably more reliable than the previous answer I gave, but let me know how it works for you.

    // utility function - returns the key of the array minimum
    function array_min_key($arr)
    {
        $min_key = null;
        $min = PHP_INT_MAX;
        foreach($arr as $k => $v) {
            if ($v < $min) {
                $min = $v;
                $min_key = $k;
            }
        }
        return $min_key;
    }
    
    // Calculate the edit distance between two strings
    function edit_distance($string1, $string2)
    {
        $m = strlen($string1);
        $n = strlen($string2);
        $d = array();
    
        // the distance from '' to substr(string,$i)
        for($i=0;$i<=$m;$i++) $d[$i][0] = $i;
        for($i=0;$i<=$n;$i++) $d[0][$i] = $i;
    
        // fill-in the edit distance matrix
        for($j=1; $j<=$n; $j++)
        {
            for($i=1; $i<=$m; $i++)
            {
                // Using, for example, the levenshtein distance as edit distance
                list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$string1,$string2);
                $d[$i][$j] = $d[$p_i][$p_j]+$cost;
            }
        }
    
        return $d[$m][$n];
    }
    
    // Helper function for edit_distance()
    function levenshtein_weighting($i,$j,$d,$string1,$string2)
    {
        // if the two letters are equal, cost is 0
        if($string1[$i-1] === $string2[$j-1]) {
            return array($i-1,$j-1,0);
        }
    
        // cost we assign each operation
        $cost['delete'] = 1;
        $cost['insert'] = 1;
        $cost['substitute'] = 1;
    
        // cost of operation + cost to get to the substring we perform it on
        $total_cost['delete'] = $d[$i-1][$j] + $cost['delete'];
        $total_cost['insert'] = $d[$i][$j-1] + $cost['insert'];
        $total_cost['substitute'] = $d[$i-1][$j-1] + $cost['substitute'];
    
        // return the parent array keys of $d and the operation's cost
        $min_key = array_min_key($total_cost);
        if ($min_key == 'delete') {
            return array($i-1,$j,$cost['delete']);
        } elseif($min_key == 'insert') {
            return array($i,$j-1,$cost['insert']);
        } else {
            return array($i-1,$j-1,$cost['substitute']);
        }
    }
    
    // attempt to find the substring of $haystack most closely matching $needle
    function shortest_edit_substring($needle, $haystack)
    {
        // initialize edit distance matrix
        $m = strlen($needle);
        $n = strlen($haystack);
        $d = array();
        for($i=0;$i<=$m;$i++) {
            $d[$i][0] = $i;
            $backtrace[$i][0] = null;
        }
        // instead of strlen, we initialize the top row to all 0's
        for($i=0;$i<=$n;$i++) {
            $d[0][$i] = 0;
            $backtrace[0][$i] = null;
        }
    
        // same as the edit_distance calculation, but keep track of how we got there
        for($j=1; $j<=$n; $j++)
        {
            for($i=1; $i<=$m; $i++)
            {
                list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$needle,$haystack);
                $d[$i][$j] = $d[$p_i][$p_j]+$cost;
                $backtrace[$i][$j] = array($p_i,$p_j);
            }
        }
    
        // now find the minimum at the bottom row
        $min_key = array_min_key($d[$m]);
        $current = array($m,$min_key);
        $parent = $backtrace[$m][$min_key];
    
        // trace up path to the top row
        while(! is_null($parent)) {
            $current = $parent;
            $parent = $backtrace[$current[0]][$current[1]];
        }
    
        // and take a substring based on those results
        $start = $current[1];
        $end = $min_key;
        return substr($haystack,$start,$end-$start);
    }
    
    // some testing
    $data = array( array('foo',' foo'), array('fat','far'), array('dat burn','rugburn'));
    $data[] = array('big yellow school bus','they rode the bigyellow schook bus that afternoon');
    $data[] = array('bus','they rode the bigyellow schook bus that afternoon');
    $data[] = array('big','they rode the bigyellow schook bus that afternoon');
    $data[] = array('nook','they rode the bigyellow schook bus that afternoon');
    $data[] = array('they','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');
    $data[] = array('controker','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');
    
    foreach($data as $dat) {
        $substring = shortest_edit_substring($dat[0],$dat[1]);
        $dist = edit_distance($dat[0],$substring);
        printf("Found |%s| in |%s|, matching |%s| with edit distance %d\n",$substring,$dat[1],$dat[0],$dist);
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have this bit of jQuery script that can be found here: http://jsfiddle.net/RUqNN/45/ When
I have just found a script we are using which has a sub that
I need to have script.sh , that would create files f1.txt and f2.txt with
I have some PHP code that help send emails. The way it works is
I have a script that I found online, that is trying to get the
I have found and customized this JQuery script, which displays different content when different
I have script that reads remote file content and writes it to local server.
i have a script which is for virtual keyboard, i am facing some problem
I have this script that run to fix my menu bar to the browser
I tried using a free script that I found on the Internet but it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.