Are you running on OS 3.0? I saw the same…

Question

0

Asked: May 11, 20262026-05-11T21:38:24+00:00 2026-05-11T21:38:24+00:00

I’m looking for the appropriate algorithm to use to compare two files. I think

0

I’m looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.

What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.

I could use diff to compare these files, but I don’t want to because:

diff tries to group changes together, finding which chunks in a file have changed. I’m only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
Generalized diff algorithms are O(mn) in runtime or space. I’m looking for something more like O(m+n) in time and O(1) in space.

Here are the constraints on the problem:

The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don’t need to group the results together, like saying “this entire directory was deleted” or “lines 100-200 are new”. I can individually list each line that is different.

I’m thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren’t necessarily sorted alphabetically, so you don’t know if one item is “greater” than another. You just know that the files that are present in both lists will be in the same order.

For what it’s worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.

Answer: This problem is called Longest Common Subsequence.

Response: I’m trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more “heuristical”. My intuition tells me that there is a linear-time algorithm due to the added constraints.

Answer: Sort them alphabetically and then compare.

Response: That would be O(m log m+n log n), which is worse than O(m+n).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T21:38:25+00:00

This isn’t quite O(1) memory, the memory requirement in the order of the number of changes, but it’s O(m+n) runtime.

It’s essentially a buffered streaming algorithm that at any given line knows the difference of all previous lines.

// Pseudo-code:
initialize HashMap<Line, SourceFile> changes = new empty HashMap
while (lines left in A and B) {
    read in lineA from file A
    read in lineB from file B

    if (lineA.equals(lineB)) continue

    if (changes.contains(lineA) && changes.get(lineA).SourceFile != A) {
         changes.remove(lineA)
    } else {
         changes.add(lineA, A)
    }

    if (changes.contains(lineB) && changes.get(lineB).SourceFile != B) {
         changes.remove(lineB)
    } else {
         changes.add(lineB, B)
    }
}

for each (line in longerFile) {
    if (changes.contains(line) && changes.get(line).SourceFile != longerFile) {
         changes.remove(line)
    } else {
         changes.add(line, longerFile)
    }
}

Lines in the HashMap from SourceFile == A have been removed
Lines in the HashMap from SourceFile == B have been added

This heavily relies on the fact the the files are listed in the same relative order. Otherwise, the memory requirement would be much larger than the number of changes. However, due to that ordering this algorithm shouldn’t use much more memory than 2 * numChanges.

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions