I have a set of string representing the history of a document. Each string is the whole document – there has not yet been any diff analysis.
I need a relatively efficient algorithm to allow me to annotate substrings of the document with the version they came from.
For example, if the document history was like this:
Rev1: The quiet fox
Rev2: The quiet brown fox
Rev3: The quick brown fox
The algorithm would give:
The quick brown fox
1111111331222222111
i.e. “The qui” was added in revision 1, “ck” was added in revision 3, ” ” was added in revision 1, “brown ” was added in revision 2 and finally “fox” was added in revision 1.
I have a class library that can do this easily, though I don’t know how well it performs performance-wise with large or many such revisions.
The library is here: DiffLib on CodePlex (you can also install it through NuGet.)
The script for your example in the question is here (you can run this in LINQPad if you add a reference to the DiffLib assembly):
The output: