Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6036717
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T05:57:57+00:00 2026-05-23T05:57:57+00:00

I have a function that processes a large image. By the specification, the largest

  • 0

I have a function that processes a large image. By the specification, the largest this image can be is 55mb. The processing entails breaking the image into several different bands and then reconstituting the image by adding these bands back into an output image. Because the image is so large, I can’t keep all four images plus input and output images in memory simultaneously on a 32 bit system. As a result, I put each image on disk and then read it back in in portions.

Prior to multithreading, the pseudocode looks like:

for (y is 0 to ysize)
   unsigned short* ptr1 = ReadLineFromDisk(image1, y)
   unsigned short* ptr2 = ReadLineFromDisk(image2, y)
   unsigned short* ptr3 = ReadLineFromDisk(image3, y)
   unsigned short* ptr4 = ReadLineFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code runs in 3 seconds on a dual core machine with a standard 500 gb hard drive using a high performance counter.

If increase the number of lines read from disk to something like 100, and then step through that, with code that looks like:

chunksize = 100;
for (y is 0 to ysize by chunksize)
   unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
   unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
   unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
   unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code is faster than the previous code, down to 1.5 seconds.

Question 1: Why is that code faster?

I hypothesize that it’s faster because, in my experience, large, contiguous reads are faster than smaller ones for the same amount of data. That is, if I read 100 lines of data all at once, that’s faster than 100 individual reads, at least for a regular (non-SSD) hard drive. Is my hypothesis close to correct?

Even so, the processor is not being used intensively here. Increasing the cache size is actually prohibitive, in that 1.5 is the best I can get, and then the value appears to drop off a bit (not sure why that would be either, except that maybe there’s some disk caching playing a role). That leads me to

Question 2: Why would there be a sweet spot in the chunk size?

If I understand things here (and I don’t think I do, really), if everything could be in memory, then that would be extremely quick, because there would be no disk hits. If reading more also makes things faster, wouldn’t reading in say a quarter of the image at a time be only a slight speed hit?

So then I switch to placing the outer loop in a lambda expression and using Intel’s TBB to thread the code, something like:

chunksize = 100;
parallel_for (y is 0 to ysize by chunksize in a lambda expression)
   unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
   unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
   unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
   unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code ranges in speeds from 0.4 seconds to 1.6 seconds.

That brings me to:

Question 3: Shouldn’t that speed increase be, at most, 2x, not 4x?

This is a dual core machine I’m running these benchmarks on, so in a perfect world, one thread reads from disk while the other processes. Even when it runs at the 4x speed increase, it’s only using 80% of the processors, not 100%, so there is a disk bottleneck still left. But a 4x increase means that something else is happening as well.

I also assume that the wide range of speed differences is that the threads are not perfectly synchronized on their reads, if that’s how the speed increase is happening. The real, final question, is:

Question 4: How can I get that 4x speed increase consistently?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T05:57:58+00:00Added an answer on May 23, 2026 at 5:57 am

    Answer 1: Yes you’re disk-bound so the CPU will not be pegged all that much and yes reading larger chunks is more efficient (as long as the chunks are aligned with the disk cache).

    Answer 2: A disk that has an 8 MB cache and is spinning at 10k RPM might get a throughput of 60 to 80 MB/sec, so the “sweet spot” would be to read chunks aligned with the cache size. You can increase your buffer, but keep it aligned with the cache size: i.e. 8MB, 16MB, 32MB, etc.

    Answer 3: Ideally you would want to dedicate one thread to reading from disk and the other to processing the data (you may want to use several threads for processing). Multi-threading the disk reads may have some small performance increase, but it is generally not the case. I don’t know why you think “something else” is happening when you get a 4x increase.

    Answer 3 Update:
    Frankly, I don’t exactly know why this is happening either, but I’ve also seen it with multithreaded disk I/O on .NET applications. As a matter of fact I even have a C# test example which demonstrates the same kind of performance increase that you’re noticing. Note that in my test I’m loading HTML pages which are roughly what you would see in the “wild” (about 80 to 160 KB each), so I’m not aligning my reads with the disk cache. It may be possible that multiple threads reading at once are actually more efficient because you are taking advantage of the disk cache despite the fact that you’re doing multiple reads. Of course, this is just an off the cuff assumption that I have no evidence to back up yet so please take it with a grain of salt! I think that if your files are large enough and your disk reading thread actually has a buffer aligned with the disk cache, then adding more threads won’t improve your speed at all. If you still see an improvement in speed then do let us know!

    Answer 4:
    Try the following:

    1. Align your buffer with the cache size of your disk.
    2. Reduce the number of applications that are also trying to access disk at the same time.
    3. Load up as much of the image in memory as you can and run enough threads to fully utilize your CPU (you’re going to improvise for the number of threads, play around and see where is the “sweet spot”).
    4. Use only one disk reading thread and make sure that it’s constantly reading!!!

    And again, you’re disk-bound, so you may never really get 100% CPU utilization.

    Answer 4 Update:
    I don’t think that Intel’s TBB are actually the thing that’s causing the performance increase you (and I) are seeing… as I said, my best guess is that multiple threads may actually be more efficient if they’re providing a better utilization of the disk cache. I’m not even sure if that’s a correct assumption, so don’t quote me without testing!

    Reading:
    I found a very detailed dissertation, titled Asynchronous/Multi Threaded I/O on Commodity Systems with Multiple Disks – a performance study, that does some amazing analysis and testing of cases where multitreaded I/O outperforms single threaded I/O. Take a look around page 86.

    Dr. Dobbs also has an article on the subject, although I didn’t have a chance to read the whole thing, I just skimmed through it.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a function that looks like this class NSNode { function insertAfter(NSNode $node)
I have a function that automatically exports a table into a CSV file, then
I have a Python program that processes fairly large NumPy arrays (in the hundreds
I have a function that I want it execute alternating processes every time it's
I have a function that gives me the following warning: [DCC Warning] filename.pas(6939): W1035
I have a function that gets x(a value) and xs(a list) and removes all
I have a function that takes, amongst others, a parameter declared as int privateCount
I have a function that I use called sqlf(), it emulates prepared statements. For
I have a function that is effectively a replacement for print, and I want
I have a function that takes a struct, and I'm trying to store its

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.