Take the following string as an example:
‘The quick brown fox’
Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.
‘The very quick dark brown fox’
Now the q is at index 9 and the f is at index 26.
What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?
Language doesn’t matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.
The sample string I gave is short but I’m hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.
Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.
I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.
Let’s say that you have a string and some of its letters are interesting. To make things easier let’s say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is ‘+the very Quick dark brown Fox’ and you are interested in q from ‘quick’ and f from ‘fox’ then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)
Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)
You can now query and update this data structure in logarithmic time.
To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.
To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+…