I have a large set of strings. I want to divide the strings into subsets such that:
- Each item in a subset shares 1 or more contiguous characters.
- The shared contiguous characters that define a subset are unique for the set of subsets (i.e. the shared characters are sufficient for defining a subset of strings that stands in a mutually exclusive relationship with other subsets).
- The subsets are roughly the same size.
- The resulting set of subsets is the minimal number of subsets needed that fit the above criteria.
For example given the following set of names:
Alan,Larry,Alfred,Barbara,Alphonse,Carl
I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters “AL” would be
Alan, Alfred, Alphonse
Subset 2 defined by contiguous characters ar would be
Larry, Barbara, Carl.
I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.
Elliott
Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.