I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.
How about this? (Pseudo-Code)
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn’t have so many collisions. Maybe using more characters if that still yields good results.