In case you are given:
- certain amount of data
- memory with size half of the data size
- part of the data is sorted
- you do not know the size of the sorted data.
Which sorting algorithm would you choose?
I am debating between insertion and quicksort. I know that the best case for insertion sort is O(n), but the worst case is O(n2). Also, considering the fact the memory is limited, I would divide the data in two parts, and on each of them do quicksort, then merge everything together. It will take O(n) time to split the data, O(n) to merge the data, and O(n log n) to sort the data using quicksort, for a net runtime of O(n log n).
Does anyone have any suggestions on how to improve this?
Your mergesort-like approach seems very reasonable. More generally, this type of sorting algorithm is called an external sorting algorithm. These algorithms often work as you’ve described – load some subset of the data into memory, sort it, then write it back out to disk. At the end, use a merging algorithm to merge everything back together. The choice of how much to load in and what sorting algorithm to use are usually the dominant concerns. I’ll focus mostly on the sorting algorithm choice.
Your concerns about quicksort’s worst-case behavior are generally speaking nothing to worry about, since if you choose the pivot randomly the probability that you get a really bad runtime is low. The random pivot strategy also works well even if the data is already sorted, as it has no worst-case inputs (unless someone knows your random number generator and the seed). You could also use a quicksort variant like introsort, which doesn’t have the worst-case behavior, as your sorting algorithm in order to avoid this worst-case.
That said, since you know that the data is already partially sorted, you may want to look into an adaptive sorting algorithm for your sorting step. You’ve mentioned insertion sort for this, but there are much better adaptive algorithms out there. If memory is scarce (as you’ve described), you might want to try looking into the smoothsort algorithm, which has best-case runtime O(n), worst-case runtime O(n log n), and uses only O(1) memory. It’s not as adaptive as some other algorithms (like Python’s timsort, natural mergesort, or Cartesian tree sort), but it has lower memory usage. It’s also not as fast as a good quicksort, but if the data really is mostly sorted it can do pretty well.
Hope this helps!