I am not looking for an algorithm to the above question. I just want someone to comment on my answer.
I was asked the following question in an interview:
How to get top 100 numbers out of a large set of numbers (can’t fit in
memory)
And this is what I said:
Divide the numbers in batches of 1000 each. Sort each batch in “O(1)” time. Total time taken is O(n) up till now. Now take 1st 100 numbers from 1st and 2nd batch (in O(1)). Take 1st 100 from the above computed nos and the 3rd batch and so on. This will take O(n) in total – so it is an O(n) algorithm.
The interviewer replies that sorting a batch of 1000 nos. won’t take O(1) time and so won’t picking out 1st 100 out of a batch and after a lot of discussion he said, he doesn’t have problem with the algo taking O(n) time, he just has a problem with me saying that sorting the batch takes O(1) time.
My explanation was that 1000 doesn’t depend on the input (n). Irrespective of what n is, I’ll always make batches of 1000 nos. and if you have to calculate, the sorting takes O(1000*log 1000)) which is essentially O(1).
If you have to make proper calculations, it would be
1000*log 1000 to sort one batch
sort (n/1000) such batches
takes 1000 * log 1000 * n/1000 = O(n*log(1000)) time = O(n) time
I asked a lot of my friends also about this and although they agreed with me but partially.
So I wan’t to know if my reasoning is 100% accurate (please criticize even if it is 99% correct).
Just remember, this post is not asking for the answer to the above posted question. I have already found a better answer at Retrieving the top 100 numbers from one hundred million of numbers
The interviewer is wrong, but it’s useful to consider why. What you’re saying is correct, but there is an unstated assumption that you depend on. Possibly, the interviewer is making a different assumption.
If we say that sorting 1000 numbers is O(1), we’re being a bit informal. Specifically, what we mean is that, in the limit as N goes to infinity, there is a constant greater than or equal to the cost of sorting the 1000 numbers. Since the cost of sorting the fixed-size set is independent of N, the limit isn’t going to depend on N, either. Thus, it’s O(1) as N goes to infinity.
A generous interpretation is that the interviewer wanted you to treat the sorting step differently. You could be more precise and say that it was O(M*log(M)) as M goes to infinity (or M goes to N, if you prefer), with M representing the size of the batches of numbers. That would make an overall O(N*log(M)) for your approach, as N and M both approach infinity. Of course, that wasn’t the limit you described.
Strictly speaking, it’s meaningless to say that something is O(1) without specifying the limit. One usually doesn’t need to bother for algorithms, because it’s clear from the context: the limit commonly taken is as a single parameter approaches infinity. Your description is correct when considering only N, but you could consider more than just N.