I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
- Executable size must be minimized. In other words, I can’t use massive libraries. What’s the most light-weight, portable threading library for C/C++?
- Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
- Should I use normal functions or functors or functionoids or some lambda functions or … something else?
- Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
- Would I need to lock my read-only and write-only arrays?
- The input is only read from, but many operations require input from more than one pixel in the array.
- The ouput is only written once per pixel.
- Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks
If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don’t just want to make a lot of threads – there’s a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few
#pragma omps in existing code, and letting the compiler handle creating threads and synchronization.In general – as long as data isn’t changing, you won’t have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won’t have to lock that either.
For OpenMP, there’s no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here’s an image-processing example from Intel (converts rgb to grayscale):
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.