I have 30,000 rows in a database that need to be similarity checked (using

Question

0

Editorial Team

Asked: May 22, 20262026-05-22T19:06:06+00:00 2026-05-22T19:06:06+00:00

I have 30,000 rows in a database that need to be similarity checked (using

0

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).

In order to do this it will require doing 30,000^2 checks for each columns.

I estimate I will be checking on average 4 columns.

This means I will have to do 3,600,000,000 checks.

What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?

The server need to still actively serve webpages at the same time as doing this.

PS. The server we are using is an 8 core Xeon 32 GB ram.

Edit:

The size of each column is normally less that 50 characters.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T19:06:07+00:00

I guess you just need FULL TEXT search.

If that not fits you, you have only one chance to solve this: cache the results.
So you will not have to parse 3bil of records for each requests

Anyway here how you can do it:

   $result = array();       

   $sql = "SELECT * FROM TABLE";
   while( $row = ... ) {
      $result[] = $row;    //> Append the current record     
   }

Now results contains all the rows from your table.

At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).

   //> Starting calculating the similarity
   foreach($result as $k=>$v) {
      foreach($result as $k2=>$v2) {

           //> At this point you have 2 rows, $v and $v2 containing your column

           $similarity = 0;

           $similartiy += levensthein($v['column1'],$v2['column1']);
           $similartiy += levensthein($v['column2'],$v2['column2']);               
           //> What ever comparison you need here between columns

           //> Now you can finally store the result by inserting in a table the $similarity
           "INSERT DELAYED INTO similarity (value) VALUES ('$similarity')"; 

      }                      
   }

2 Things you have to notice:

I used levensthein because it’s much faster than similar_text (notice it’s value it’s the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 30,000 rows in a database that need to be similarity checked (using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply