I have a file that has 1 million numbers. I need to know how I can sort it efficiently, so that it doesn’t stall the computer, and it prints ONLY the top 10.
#!/usr/bin/python3
#Find the 10 largest integers
#Don't store the whole list
import sys
def fOpen(fname):
try:
fd = open(fname,"r")
except:
print("Couldn't open file.")
sys.exit(0)
all = fd.read().splitlines()
fd.close()
return all
words = fOpen(sys.argv[1])
big = 0
g = len(words)
count = 10
for i in range(0,g-1):
pos = i
for j in range(i+1,g):
if words[j] > words[pos]:
pos = j
if pos != i:
words[i],words[pos] = words[pos],words[i]
count -= 1
if count == 0:
print(words[0:10])
I know that this is selection sort, I’m not sure what would be the best sort to do.
If you only need the top 10 values, then you’d waste a lot of time sorting every single number.
Just go through the list of numbers and keep track of the top 10 largest values seen so far. Update the top ten as you go through the list, and print them out when you reach the end.
This will mean you only need to make a single pass through the file (ie time complexity of theta(n))
A simpler problem
You can look at your problem as a generalization of finding the maximum value in a list of numbers. If you’re given
{2,32,33,55,13, ...}and are asked to find the largest value, what would you do? The typical solution is to go through the list, while remembering the largest number encountered so far and comparing it with the next number.For simplicity, let’s assume we’re dealing with positive numbers.
So you see, we can find the max in a single traversal of the list, as opposed to any kind of comparison sort.
Generalizing
Finding the top 10 values in a list is very similar. The only difference is that we need to keep track of the top 10 instead of just the max (top 1).
The bottom line is that you need some container that holds 10 values. As you’re iterating through your giant list of numbers, the only value you care about in your size-10-container is the minimum. That’s because this is the number that would be replaced if you’ve discovered a new number that deserves to be in the top-10-so-far.
Anyway it turns out that the data structure best fit for finding mins quickly is a min heap. But I’m not sure if you’ve learned about heaps yet, and the overhead of using a heap for 10 elements could possibly outweigh its benefits.
Any container that holds 10 elements and can obtain the min in a reasonable amount of time would be a good start.