To explain in a clearer way my question I will start by explaining the real-life case I am facing.
I am building a physical panel with many words on it that can be selectively lit, in order to compose sentences. This is my situation:
- I know all the sentences that I want to display
- I want to find out [one of] the shortest set of ORDERED words that allows me to display all the sentences
Example:
SENTENCES:
"A dog is on the table"
"A cat is on the table"
SOLUTIONS:
"A dog cat is on the table"
"A cat dog is on the table"
I tried to approach this problem with “positional rules” finding for each UNIQUE word in the set of ALL the words used in ALL the sentences, what words should be at the left or at the right of it. In the example above, the ruleset for the “on” word would be “left(A, dog, cat, is) + right(the, table).
This approach worked for trivial cases, but my real-life situation has two additional difficulties that got me stuck and that have both to do with the need for repeating words:
- In-sentence repetitions: “the cat is on the table” has two “the”.
- Circular references: In a set of three sentences “A red cat” + “My cat is on the table” + “That table is red”, the rules would state that RED should be at the left of CAT, CAT should be at the left of TABLE and TABLE should be at the left of RED.
MY QUESTION THEREFORE IS:
What is the class of algorithms (or
even better: what is the specific
algorithm) that studies and solves
this kind of problems? Could you post
some reference or a code example of
it?
EDIT: Level of complexity
From the first round of answers it appears the actual level of complexity (i.e. how different are the sentences one from the other) is an important factor. So, here comes some info on that:
- I have about 1500 sentences I want to represent.
- All of the sentences are essentially modifications of a restricted pool of ~10 sentences where only a few words change. Building on the previous example, it’s a bit like all my sentences would speak about either “somebody’s pet’s position relative to a piece of furniture” or “a physical description of somebody’s furniture”.
- The number of unique words used to build all the sentences is <100.
- Sentences are 8 words long at most.
For this project I am using python, but any language reasonably readable (eg: NOT obfuscated perl!) will be fine.
Thank you in advance for your time!
After a week of so of coding (this is hobby project) I decided to answer my own question. This is not because the answers I previously got weren’t good enough, but rather because I used the three of them to code the solution I wanted, and it felt wrong to give credit to only one of the responders, as I truly used input by the three of them to come up with a satisfactory solution.
Let’s start from the end: the heuristic I came up gives very satisfactory results (at least for the real-life case I am using it for). I had 1440 sentences with an average of ~6 words each and using a set of 70 unique words. The program takes about 1 minute to run and provides me with a supersequence of just 76 words (10% more than the “physical” [but not “logical”] lower limit of the problem).
The heuristic is really tailored around my real-life case, in particular around the fact that most of the sentences are constructed around 10 or so “prototypes” (see point #2 of my edit in the question) and is composed of four successive steps:
Isomorphic shrinking
I defined as “isomorphic” two sentences A and B such than transforming A in B and B in A would require exactly the same steps. Example:
The transformation always require to change words in position 1, 3, 5 of the first string with words in position 1, 3, 5 of the second.
Grouping sentences in “isomorphic families” allows to easily create optimised superstrings by simply inserting in the common root
"A __ is __ the __"the list of variable elements for position 1, 3, 5.Similarity shrinking
At this stage of the process the number of sentences has dramatically lowered (there is normally a mixture of about 50 supersequences from isomorphic families and orphan sentences that were not isomorphic to any other in the pool). The program analyses this new pool and merges together the two most similar strings, then repeating the procedure on the new set composed by the result of the merge and the strings that haven’t been merged yet, and so on until everything has been merged into one supersequence.
Coarse redundancy optimisation
At this stage we already have a valid supersequence to all the original 1440 sentences, but such supersequence is grossly suboptimal as many terms are repeated without the need for it. This optimisation step remove the bulk of the redundant terms by simply using the supersequence to formulate all the sentences, and then removing from it all the terms that haven’t been used.
Fine redundancy optimisation
The result of the previous optimisation is pretty good already, but sometimes is possible to trim out a word of two via this last step. The optimisation here works by finding words that are repeated more than once, and checking if it possible to make two successive repetitions to converge towards the same location in the supersequence and still formulate all the sentences. Example: given the supersequence
the program will try to shift the two
xxxwords towards each other:if the supersequence reaches:
and no sentence uses both occurrences of
xxxat the same time, then one of the twoxxxcan go.Of course this last passage could be optimised by shifting
xxxof more than one position at a time (think of shell sorting vs. bubble sorting), but the general structure of the program is such, and the gain in speed so little, that I preferred this “less optimised” procedure, as this “shifting procedure” is used elsewhere too.Again, many thanks to all the original responder: your input was paramount to make me think of this solution. Your comments to this solution are also very welcomed!
FINAL NOTE: As soon as the program will be complete / the project finished (a couple of months at worst), I will release it under GPL and add the link to the code repo in the original question. I believe that marking the question as “favourite” should notify the marker of any edit…. just in case you are interested in the actual code, of course!