I am looking to optimize a fairly simple algorithm that is currently O(n 2

Question

0

Asked: May 23, 20262026-05-23T17:01:42+00:00 2026-05-23T17:01:42+00:00

I am looking to optimize a fairly simple algorithm that is currently O(n 2

0

I am looking to optimize a fairly simple algorithm that is currently
O(n²). I have a file of records, where each one needs to
be compared to every other in the same file. If the two are the
‘same’ (comparator function is pretty complicated), the matched
records are output. Note that there may be several records that match
each other, and there is no sense of order – only if the match is True or False.

Pseudo code:


For (outRec in sourceFile) {
  Get new filePointer for targetFile //starting from the top of the file for inner loop
  For (inRec in targetFile) {
    if (compare(outRec, inRec) == TRUE ) {
      write outRec
      write inRec
    }
    increment some counters
  }
  increment some other counters
}

The data is not sorted in any way, and there is no preprocessing
possible to order the data.

Any ideas on how this could become something less than
O(n²)? I am thinking of applying the MapReduce paradigm
on the code, breaking up the outer AND inner loops, possibly using a
chained Map function. I am pretty sure I have the code figured out on
Hadoop, but wanted to check alternatives before I spent time coding
it.

Suggestions appreciated!

Added: Record types. Basically, I need to match names/strings. The
types of matching are shown in the example below.


1,Joe Smith,Daniel Foster

2,Nate Johnson,Drew Logan

3,Nate Johnson, Jack Crank

4,Joey Smyth,Daniel Jack Foster

5,Joe Morgan Smith,Daniel Foster



Expected output:
Records 1,4,5 form a match set
End of output

Added: these files will be quite large. The largest file is
expected to be around 200 million records.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T17:01:42+00:00

Editorial Team

2026-05-23T17:01:42+00:00Added an answer on May 23, 2026 at 5:01 pm

Assuming the files aren’t ridiculously large, I’d go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking to optimize a fairly simple algorithm that is currently O(n 2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply