Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8223473
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T14:42:50+00:00 2026-06-07T14:42:50+00:00

Which is faster to process a 1TB file: a single machine or 5 networked

  • 0

Which is faster to process a 1TB file: a single machine or 5 networked
machines? (“To process” refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.

Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:

Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)

1) Single Machine
 a) Time to Read File from Disk --> 5.8 hrs
   -If it takes 20ms to read 1MB seq from disk, 
    then to read 1TB from disk takes: 
    20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs 
    = 350 mins = 5.8 hrs 

 b) Time needed to fill array w/complete count data 
    --> 0 sec since it is computed while doing step 1a
    -At 0.5 MB, the count array fits into L2 cache. 
     Since L2 cache takes only 7 ns to access, 
     the CPU can read & write to the count array 
     while waiting for the disk read. 
     Time: 0 sec since it is computed while doing step 1a

 c) Iterate thru entire array to find max count --> 0.00625ms
   -Since it takes 0.0125ms to read & write 1MB from 
    L2 cache and array size is 0.5MB, then the time 
    to iterate through the array is: 
    0.0125ms/MB x 0.5MB = 0.00625ms  

 d) Total Time 
    Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)

2) 5 Networked Machines   
   a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
      1TB x 1024GB/TB x 8bits/B x 1s/Gbit 
      = 8,192s = 137m = 2.3hr
      But since the original machine keeps a fifth of the data, it
      only needs to send (4/5)ths of data, so the time required is: 
      2.3 hr x 4/5 = 1.84 hrs
      *But to send the data, the data needs to be read, which
       is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
       So total time = 1.84hrs + 4.64 hrs = 6.48 hrs

   b) Time to fill array w/count data from original machine --> 1.16 hrs
      -The original machine (that had the 1TB file) still needs to
       read the remainder of the data in order to fill the array with
       count data. So this requires (1/5)(answer 1a)=1.16 hrs.  
       The CPU time to read & write to the array is negligible, as 
       shown in 1b.      

   c) Time to fill other machine's array w/counts --> not counted   
      -As the file is being transferred, the count array can be 
       computed. This time is not counted. 

   d) Time required to receive 4 arrays --> (2^-6)s
      -Each count array is 0.5MB
       0.5MB x 4 arrays x 8bits/B x 1s/Gbit 
       = 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits 
       = 2^25/2^31s = (2^-6)s 

   d) Time to merge arrays  
      --> 0 sec(since it can be merge while receiving)

   e) Total time 
      Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs 
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T14:42:52+00:00Added an answer on June 7, 2026 at 2:42 pm

    This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.

    You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.

    In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Which is faster, to return ajax in JSON and then process JSON response to
I have a big files to read and process. Which is the faster method
Which is faster: Union or Concat ? I don't care about the order of
Possible Duplicate: Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc I
Possible Duplicate: Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc. I
Is there any benchmark or comparison which is faster: place nginx in front of
For reading the data from the database which is faster BCP or Data reader?
Looking into selector performance between $('#ID1, #ID2, #ID3') vs $('1X CLASS'). Which is faster?
Which one is faster? Which one uses less memory? Console.WriteLine(string1) Console.WriteLine(string2) Console.WriteLine(string3) Console.WriteLine(stringNth) or
which one is faster select * from parents p inner join children c on

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.