I am looking to optimize a fairly simple algorithm that is currently
O(n2). I have a file of records, where each one needs to
be compared to every other in the same file. If the two are the
‘same’ (comparator function is pretty complicated), the matched
records are output. Note that there may be several records that match
each other, and there is no sense of order – only if the match is True or False.
Pseudo code:
For (outRec in sourceFile) {
Get new filePointer for targetFile //starting from the top of the file for inner loop
For (inRec in targetFile) {
if (compare(outRec, inRec) == TRUE ) {
write outRec
write inRec
}
increment some counters
}
increment some other counters
}
The data is not sorted in any way, and there is no preprocessing
possible to order the data.
Any ideas on how this could become something less than
O(n2)? I am thinking of applying the MapReduce paradigm
on the code, breaking up the outer AND inner loops, possibly using a
chained Map function. I am pretty sure I have the code figured out on
Hadoop, but wanted to check alternatives before I spent time coding
it.
Suggestions appreciated!
Added: Record types. Basically, I need to match names/strings. The
types of matching are shown in the example below.
1,Joe Smith,Daniel Foster
2,Nate Johnson,Drew Logan
3,Nate Johnson, Jack Crank
4,Joey Smyth,Daniel Jack Foster
5,Joe Morgan Smith,Daniel Foster
Expected output:
Records 1,4,5 form a match set
End of output
Added: these files will be quite large. The largest file is
expected to be around 200 million records.
Assuming the files aren’t ridiculously large, I’d go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.