I am experiencing a weird problem with the Java ProcessBuilder. The code is shown below (in a slightly simplified form)
public class Whatever implements Runnable
{
public void run(){
//someIdentifier is a randomly generated string
String in = someIdentifier + "input.txt";
String out = someIdentifier + "output.txt";
ProcessBuilder builder = new ProcessBuilder("./whateveer.sh", in, out);
try {
Process process = builder.start();
process.waitFor();
} catch (IOException e) {
log.error("Could not launch process. Command: " + builder.command(), e);
} catch (InterruptedException ex) {
log.error(ex);
}
}
}
whatever.sh reads:
R --slave --args $1 $2 <whatever1.R >> r.log
Loads of instances of Whatever are submitted to an ExecutorService of fixed size (35). The rest of the application waits for all of them to finish- implemented with a CountdownLatch. Everything runs fine for several hours (Scientific Linux 5.0, java version “1.6.0_24”) before throwing the following exception:
java.io.IOException: Cannot run program "./whatever.sh": java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(Unknown Source)
... rest of stack trace omitted...
Does anyone have an idea what this means? Based on the google/bing search results for java.io.IOException: error=11, it is not the most common of exceptions and I am completely baffled.
My wild and not so educated guess is that I have too many threads trying to launch the same file at the same time. However, it takes hours of CPU time to reproduce the problem, so I have not tried with a smaller number.
Any suggestions are greatly appreciated.
The
error=11is almost certainly theEAGAINerror code:The
clone(2)system call documents anEAGAINerror return:The
fork(2)system call documents twoEAGAINerror returns:If you were really that low on memory, it would almost certainly show in the system logs. Check
dmesg(1)output or/var/log/syslogfor any potential messages about low system memory. (Other things would break. This doesn’t seem too plausible.)Much more likely is running into either the per-user limit on processes or system-wide maximum number of processes. Perhaps one of your processes isn’t properly reapting zombies? This would be very easy to spot by checking
ps(1)output over time:(Maybe check every minute or ten minutes if it really does take hours before you’re in trouble.)
If you’re not reaping zombies, then read up on whatever you must do to ProcessBuilder to use
waitpid(2)to reap your dead children.If you’re legitimately running more processes than your rlimits allow, you’ll need to use
ulimitin yourbash(1)scripts (if running asroot) or set higher limits in/etc/security/limits.conffor thenprocproperty.If you are instead running into the system-wide process limits, you might need to write a larger value into
/proc/sys/kernel/pid_max. Seeproc(5)for some (short) details.