A colleague came to me with a problem that I managed to answer but I don’t know if my answer is right or even good…
He is creating a program to compare data in various files – in this case excel spreadsheets. He has a list of comparisons which will boil down to two files with references to cells in them. For each comparison it is necessary to open the files, do the comparison and then close the files.
Of course this can be optimised if you order the comparisons such that you can keep one file and just change the other.
So how should you sort the files to minimise the number of times you need to close and open files?
It should be noted that the idea of just having all files open is not feasible since there could be over 500 different spreadsheets being compared.
My solution was to find the table that occurs in most comparisons and process all the comparisons involving that first. Then repeat the process ignoring all the comparisons that have already been done.
I am wondering if when you process that first batch you want to do the least common ones first, ending up with the most common appearing table – this is then the table you process next (meaning still only one file change).
So can anybody either give me a better option or confirm that my idea is good (or good enough)?
Concrete example:
Here is an example list of comparisons with a note next to them showing how many files need to be unloaded and loaded each time. eg after Comparing fileA and fileB it only needs to unload FileB and load FileC to do the next compariosn. After comparing FileA and FileF it needs to unload both to load FileB and FileC.
FileA FileB
FileA FileC One file change
FileA FileD One file change
FileA FileE One file change
FileA FileF One file change
FileB FileC Two file changes
FileB FileF One file change
FileC FileD Two file changes
FileC FileE One file change
FileD FileF Two file changes
FileE FileF One file change
In theory in this example the order of the comparisons can be rearranged to make it so that at each step you only need to unload and reload one file.
FileA FileB
FileA FileD One file change
FileA FileE One file change
FileA FileF One file change
FileA FileC One file change
FileB FileC One file change
FileC FileD One file change
FileC FileE One file change
FileE FileF One file change
FileB FileF One file change
FileD FileF One file change
So what I want to know is what the best algorithm is to sort the file pairs to get the minimum number of total file unload/load operations.
I should note that it is not always going to be posible to get it down to one file change each time as demonstrated by the trivial pair of comparisons below:
FileA FileB
FileC FileD Two file changes
Here’s an idea:
Consider a graph where each file is a node, and each required comparison is an edge.
Now, if you find a Eulerian Path in the graph, this path will represent a sequence such that only one file replacement happens after each comparison.
If no Eulerian Path exists, then once you cannot proceed with the path, just jump to some node with an odd number of edges (and if all of them have an even number of edges, just pick any node). This approach will probably still give you the best results, but at some point(s) in the sequence, you will have to replace two files instead of one. I believe it should be easy to prove that if no Eulerian Path exists, then no sequence exists that only replaces one file at each step.