There is 30 files, any one contains about 100,000 data items, the data item just like this:
key->count,for example, abcdefg->100, which means the key ‘abcdefg’ ‘s count value is 100, the key can just appeared in one file one time, but it could appeared in other files.
How should I get the 10 keys, its total count value should in all the top 10 from 30 files.
any help would be greatly appreciated.
I am assuming you want the 10 keys with maximal total count [which seems to be true according to your first comment]
Design Guidelines:
system are ~11.5 MB], and assuming the key is not too
large1, the entire data set might be populated into
memory.
Algorithm :
HashMap:key->int, which will bepopulated while you are reading the files. For each key you are reading, if it is already in the histogram – add the count to the existing value in the histogram, and if it doesn’t exist – just add the (key,count) pair to the histogram. [
O(n)average run time]create a min heap, and iterate the histogram, the heap should always
contain the top 10 values, and the matching keys – of course.
There is an explanation how to do it in this thread. – for constant top10, it is
O(n)as well.Advantages:
this will probably be the bottleneck – so minimizing the disk
reads/writes should be a priority.
O(n)average run time.Disadvantage:
1: If the assumption is not true, it can be partially solved by hashing the keys, and storing only the keys. Check for equality once you have hash collision – in the disk itself. It will increase the number of reads, but the number of collisions should be relatively low, with a good hash function. Also, you should load the the keys that their hash collided to memory [again, to avoid multiple disk reads], and only them, it will be a much smaller number then the total number of elements.