Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7849931
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T18:41:24+00:00 2026-06-02T18:41:24+00:00

There is 30 files, any one contains about 100,000 data items, the data item

  • 0

There is 30 files, any one contains about 100,000 data items, the data item just like this:
key->count,for example, abcdefg->100, which means the key ‘abcdefg’ ‘s count value is 100, the key can just appeared in one file one time, but it could appeared in other files.

How should I get the 10 keys, its total count value should in all the top 10 from 30 files.

any help would be greatly appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T18:41:25+00:00Added an answer on June 2, 2026 at 6:41 pm

    I am assuming you want the 10 keys with maximal total count [which seems to be true according to your first comment]

    Design Guidelines:

    • Since the data is not too large [100,000 * 30 integers on 32 bits
      system are ~11.5 MB], and assuming the key is not too
      large1, the entire data set might be populated into
      memory.
    • When the data is in memory – you can do anything much faster on it, since disk IO is extremely slower then RAM, so sorting it and reading it multiple times is expected to be much slower then manipulating the data on memory.

    Algorithm :

    1. Create a histogram, which will actually be a HashMap:key->int, which will be
      populated while you are reading the files. For each key you are reading, if it is already in the histogram – add the count to the existing value in the histogram, and if it doesn’t exist – just add the (key,count) pair to the histogram. [O(n) average run time]
    2. Once the histogram is populated – finding the top 10 is easy –
      create a min heap, and iterate the histogram, the heap should always
      contain the top 10 values, and the matching keys – of course.
      There is an explanation how to do it in this thread. – for constant top10, it is O(n) as well.
    3. When you are done – the heap contains your solution, just show its content.

    Advantages:

    • only one disk read – since the disk is much slower then RAM –
      this will probably be the bottleneck – so minimizing the disk
      reads/writes should be a priority.
    • O(n) average run time.

    Disadvantage:

    • If you have very poor hash function [unlikely] – due to the hash table, the solution might decay to quadric time complexity.
    • More work should be done if the keys are large and do not fit in memory – see footnote (1) how it can be solved.

    1: If the assumption is not true, it can be partially solved by hashing the keys, and storing only the keys. Check for equality once you have hash collision – in the disk itself. It will increase the number of reads, but the number of collisions should be relatively low, with a good hash function. Also, you should load the the keys that their hash collided to memory [again, to avoid multiple disk reads], and only them, it will be a much smaller number then the total number of elements.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there any external library using which one can edit and save XML files
Is there any way to have something like svn externals for files stored in
I need to check a directory to see if there are any files whose
Is there a log viewer for displaying Ruby log files from any of its
Is there any native compression (for javascript/css files) available in ASP.NET?
Is there any way I can run class files (i.e. with main as the
Is there any method to preview Excel files on XP besides WebBrowser control? How
Is there any command available for generating all missing spec files for existing models
Are there any editors that can edit multi-gigabyte text files, perhaps by only loading
Would there be any point in adding resx files to a distributed cache? What

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.