I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I’ve had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ;
do
some_heavy_processing_command ${name} &
done
So what I want is essentially similar to what “gmake -j4” does.
I know bash supports the “wait” command but that only waits untill all child processes have completed. In the past I’ve created scripting that does a ‘ps’ command and then grep the child processes out by name (yes, i know … ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The “xargs –max-procs=4” works like a charm.
(So I voted to close my own question)
Having said that Fredrik makes the excellent point that xargs does exactly what you want…