Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8267225
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T05:26:28+00:00 2026-06-08T05:26:28+00:00

There is a quite large file (>10G) on the disk, each line inside the

  • 0

There is a quite large file (>10G) on the disk, each line inside the fie is composed of a line-number and a person’s name, like this:

1 Jane
2 Perk
3 Sime
4 Perk
.. ..

I have to read this large file, and find the frequency of each name, finally output the results in descending order of each name’s frequency, like this:

Perk 2
Jane 1
Sime 1

As the interviewer requested, the above job should be done as efficiently as possible, and multithreading is allowed. And my solution is something like this:

  1. Because the file is too large, I partition the file into several small files, each small file is about 100M, via lseek I can locate the begin and the end of each small file (beg, end);

  2. For these small files, there is a shared hash-map using person’s name as key and how many times it shows so far as value;

  3. For each small file, there is a single thread go through it, every time the thread encounters a person’s name, it will increment its corresponding value in the shared hash-map;

  4. When all threads finish, I think it’s time to sort the hash-map according to the value field.

But because there might be too many names in that file, so the sorting would be slow. I didn’t come up with a good idea about how to output the names in descending order.

Hope anyone can help me with the above problem, give me a better solution on how to do the job via multithreading and the sorting stuff.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T05:26:31+00:00Added an answer on June 8, 2026 at 5:26 am

    Using a map-reduce approach could be a good idea for your problem. That approach would consist of two steps:

    1. Map: read chunks of data from the file and create a thread to process that data
    2. Reduce: the main thread waits for all other threads to finish and then it combines the results from each individual thread.

    The advantage of this solution is that you would not need locking between the threads, since each one of them would operate on a different chunk of data. Using a shared data structure, as you are proposing, could be a solution too, but you may have some overhead due to contention for locking.

    You need to do the sorting part at the reduce step, when the data from all the threads is available. But you might want to do some work during the map step, so that it is easier (quicker) to finish the complete sort at the reduce step.

    If you prefer to avoid the sequential sorting at the end, you could use some custom data structure. I would use a map (something like a red-black tree or a hash table) for quickly finding a name. Moreover, I would use a heap in order to keep the order of frequencies among names. Of course, you would need to have parallel versions of those data structures. Depending on how coarse the parallelization is, you may have locking contention problems or not.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Which AMI would you suggest for CentOS 5.x 64-bit? There is quite a large
Quite a few apps support plugins. Are there any downsides to having a large
In my C project I have quite a large utils.c file. It is really
I think I've almost got it, but not there quite... I want to select
There are quite a few questions on why local notification is not firing properly
There are quite a few of IRC server codes I am working on a
There is quite a few code samples on PayPal GitHub showing how to implement
There are quite a few freely available datetime pickers in Javascript. However, you can
I am visiting some old code, and there are quite a few events declared
There seems to be quite a bit of folklore knowledge floating about in restricted

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.