So I’ve been trying to solve this for some hours now, but apparently there’s still something missing. Maybe I’m thinking the wrong way, but I think it is a very complex problem:
I have three lists with items in a fixed order. For explaining the problem assume they contain items A to Z – mostly in the same order with some exceptions, where items can be in different positions. Also only one list contains all items – the other contain a subset and are missing certain items. As a solution for this problem would be sufficient, it could be possible to have no list with all items, but only partly overlapping sets. Even better would be an algorithm to solve the problem with multiple (> 3) lists.
So here’s the example:
List 1: A B C D E F G H I J
List 2: A C D B F G
List 3: B C D E H F G
Now what I want is to match these three lists to visualize where the sort order is different and where are items that are missing. So the result should be:
List 1: A B C D E F G H I J
List 2: A C D B F G
List 3: B C D E H F G
So I immediately see, that List 2 has a B at the wrong position, A is missing from List 3, which also has H in the wrong position.
I was thinking about storing the result in a CSV to import into Excel. So the rows are:
A,A,
B,,B
C,C,C
...
Now my question is: how do I match the lists that way to generate the CSV output? The language I use is Java. So far I failed with the problem that a list other than the reference list contains items earlier, which appear later in the reference list.
This is by the way a real-world problem.
Any suggestions are appreciated.
There are off-the-shelf tools for solving this problem, such as the Unix tool
diff3. Trying to solve it for arbitrary numbers of lists is not advisable unless you are willing to invest a lot of time in developing heuristics, as you are then dealing with the NP-hard general case of the longest common subsequence problem.