I have a problem here which i am trying to solve.
The program is given a text file containing the following characters: a – z, A – Z, 0 – 9, fullstop (.) and space . Words in the text file are purely made up of a-z, A-Z and 0-9 . The program receives several queries. Each query is made up of a set of full words already present in the file. The program should return the smallest phrase from the file where all words are present (in any order) . If there are many such phrases, return the first one.
Here is an example. Let us say that the file contains:
Bar is doing a computer science degree. Bar has a computer at home. Bar is now at home.
Query 1 :
Bar computer a
Response:
Bar has a computer
Query 2:
Bar home
Response:
home. Bar
I thought of this solution. For query 1, Bar is searched first and all three occurences of Bar is assembled as a list. Each node in list also contain the starting position of the smallest phrase and the total length. So it’ll look like
1st node “Bar, 0, 1” [Query, starting posn, total length].
Similarly for 2nd and 3rd node.
Next computer is searched for. The minimum distance of computer for each occurence of Bar is calculated.
1st node “Bar Computer”, 0, 5
2nd node “Bar Computer”, 7 , 4 and so on for other nodes
Next “a” is searched for. The search has to start from the starting position that is mentioned each node and has to be traversed left and right until the word is found as order is unimportant. The minimum of the occurence has to be chosen.
Is this solution on right track? I feel that doing this way, i have to be wary of many cases and there might be a simpler solution available.
If the words are unique, it becomes a variant of TSP?
TSP isn’t a great way to think about this problem. Let n be the length of the text and m be the length of the query; assume n > m. The naive solution
is already polynomial-time at O(n3 m) for bounded-length words. Now let’s optimize.
First, hoist the inner loop via a hash set.
The running time is now O(n3), or O(n3 log n) if we use a binary tree instead.
Observe now that it’s wasteful to recompute
subtext_setwhen the upper bound increases by one.We’re at O(n2 m). Now it seems wasteful to recheck the entire query when
subtext_setis augmented by just one element: why don’t we just check that one, and remember how many we have to go?We’re at O(n2). Getting to O(n) requires a couple of insights. First, let’s look at how many query words each substring contains for the example
This matrix has non-increasing columns and non-decreasing rows, and that’s true in general. We want to traverse the underside of the entries with value m, because further in corresponds to a longer solution. The algorithm is the following. If the current i, j have all of the query words, then increase i; otherwise, increase j.
With our current data structures, increasing j is fine but increasing i is not, because our data structures don’t support deletion. Instead of a set, we need to keep a multi-set and decrement
num_foundwhen the last copy of a query word disappears.We’ve arrived at O(n). The last asymptotically relevant optimization is to reduce the extra space usage from O(n) to O(m) by storing counts only for elements in the query. I’ll leave that one as an exercise. (Also, some more care must be taken to handle empty queries.)