There is 30 files, any one contains about 100,000 data items, the data item

Question

0

Asked: June 2, 20262026-06-02T18:41:24+00:00 2026-06-02T18:41:24+00:00

There is 30 files, any one contains about 100,000 data items, the data item

0

There is 30 files, any one contains about 100,000 data items, the data item just like this:
key->count，for example, abcdefg->100, which means the key ‘abcdefg’ ‘s count value is 100, the key can just appeared in one file one time, but it could appeared in other files.

How should I get the 10 keys, its total count value should in all the top 10 from 30 files.

any help would be greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T18:41:25+00:00

I am assuming you want the 10 keys with maximal total count [which seems to be true according to your first comment]

Design Guidelines:

Since the data is not too large [100,000 * 30 integers on 32 bits
system are ~11.5 MB], and assuming the key is not too
large¹, the entire data set might be populated into
memory.
When the data is in memory – you can do anything much faster on it, since disk IO is extremely slower then RAM, so sorting it and reading it multiple times is expected to be much slower then manipulating the data on memory.

Algorithm :

Create a histogram, which will actually be a HashMap:key->int, which will be
populated while you are reading the files. For each key you are reading, if it is already in the histogram – add the count to the existing value in the histogram, and if it doesn’t exist – just add the (key,count) pair to the histogram. [O(n) average run time]
Once the histogram is populated – finding the top 10 is easy –
create a min heap, and iterate the histogram, the heap should always
contain the top 10 values, and the matching keys – of course.
There is an explanation how to do it in this thread. – for constant top10, it is O(n) as well.
When you are done – the heap contains your solution, just show its content.

Advantages:

only one disk read – since the disk is much slower then RAM –
this will probably be the bottleneck – so minimizing the disk
reads/writes should be a priority.
O(n) average run time.

Disadvantage:

If you have very poor hash function [unlikely] – due to the hash table, the solution might decay to quadric time complexity.
More work should be done if the keys are large and do not fit in memory – see footnote (1) how it can be solved.

1: If the assumption is not true, it can be partially solved by hashing the keys, and storing only the keys. Check for equality once you have hash collision – in the disk itself. It will increase the number of reads, but the number of collisions should be relatively low, with a good hash function. Also, you should load the the keys that their hash collided to memory [again, to avoid multiple disk reads], and only them, it will be a much smaller number then the total number of elements.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

There is 30 files, any one contains about 100,000 data items, the data item

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply