Sometimes, builds done by Jenkins (1.461) will stop at a random spot somewhere in the middle. These builds are manually scripted calls to Visual Studio 2008 SP1’s devenv.com for primarily C++ code. Visual Studio emits no error messages; the last message in devenv’s log is some random file being built. The Jenkins build fails because of a post-build Windows batch command that relies on some of the build outputs. This happens fairly rarely (roughly 1 in 15 builds). Jenkins’s error log shows nothing out of the ordinary around the time the build fails. Surprisingly, it says the build succeeded, even though it shows it as failed everywhere else.
The problem is isolated to Jenkins. The same build script run at a developer’s desk has never failed in this way.
The Jenkins nodes are 32-bit Windows XP machines. They all have ample available disk space. Jenkins is configured to only run one job at a time per node. The event logs show no obviously bad things (e.g., Visual Studio crashes) happening at the times when the builds stop.
Does anyone have any ideas of things to look into to troubleshoot this?
We ended up correlating the random build failures with logoff events on the Jenkins nodes. This lead to this JVM bug/feature (Oracle Java bug ID 6871190), where a logoff event in Windows causes the signal handlers to terminate the JVM. You can disable this behavior (perhaps with other downsides) by passing the -Xrs option to the JVM, but that option will not automatically propagate to child Java processes.
We are passing -Xrs to kick off Jenkins itself, and the Jenkins service itself lives through a logoff. The current hypothesis is that some part of Jenkins’s build process is kicked off through another Java child process who is not invoked with -Xrs.