We are running some smoke tests on our EC2 systems and occasionally the VM becomes completely unresponsive while it makes 1 core spin on 100%, all other cores are at 0% at this point. It does not allow further connections (RMI, JMX, HTTP requests to Jetty, all fail).
Info :
- High-CPU Extra Large Instance (instance type c1.xlarge)
- Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
- Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
Has anyone experience something like this before? Any information would be much appreciated, thanks!
We found the issue. We noticed that some instances in our cluster always produced the problem and some never did. Apparently the issue is unique to instances that run more recent CPU versions combined with slightly outdated kernels.
The issue is explained in full here : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/727459
As the running time increases the spikes shown in the graph below become longer and longer, possible due to time shift and at some point it’ll spin the one core for a long time.
Cpu usage graphs. Affected instance versus unaffected:
The issue has been fixed recently so a kernel update fixed this issue for us.