Background in case you care, if not skip it:
I was recording some audio today for a project, doing it a paragraph at a time. If I messed up the paragraph, I redid it until I got it right, and then moved on the next paragraph. When I loaded them onto the computer, I needed to find the last recording for each paragraph. Without any knowledge of the number of recordings I made for a particular paragraph, how do I go about this? (Don’t you love it when algorithms sneak up into your daily life?)
In algorithms terms, you have an array of elements, where each element is either followed by another element of the same type, or a completely different element. Find each last element of the sequence (the audio clip correctly recorded).
The problem:
So you have an array of objects where each element with an id field, where each id is in the following list. I want the objects that are the last of their id, say in array of id’s like this:
aabbbbbccddddddddddddddeefffffffffggghhhhiiiijjklmnnnnoo
Obviously if the length of the string is n and there are n distinct elements, it will take you n steps to figure it out. I’m more interested in the general algorithm. I could do it with a binary search type algorithm, but I don’t know the runtime of it in the case with no knowledge of the input except the number of total elements.
Also, would knowing the number of distinct id’s change the runtime of the algorithm? This is an interesting problem to me and I’m asking to only satisfy my intellectual curiosity.
You should be able to look at the first id, and do a binary search for where that id ends. This can be done in O(log n) time.
You then step forward to the next element, and redo the binary search for where that id-sequence ends.
This yields an algorithm of complexity O(m × log n) where n is the number of elements and m the number of distinct elements.
Assuming n / m (the average number of elements for a specific id) is greater than log n you get a sub-linear algorithm.
If n / m is less than log n you are better of searching for the end of the id-sequence linearly.
(Note that this whole analysis depends on the fact that the list is sorted on the IDs. Sorting typically takes time proportional to n × log n so if you need to sort them, you can just as well go with a linear algorithm 🙂