I am trying to run my code on NVIDIA’s K10 GPU. I am using

Question

0

Asked: June 16, 20262026-06-16T03:24:41+00:00 2026-06-16T03:24:41+00:00

I am trying to run my code on NVIDIA’s K10 GPU. I am using

0

I am trying to run my code on NVIDIA’s K10 GPU. I am using 5.0 CUDA Driver and 4.2 CUDA runtime. The problem is that the time taken by the kernel increases with iterations, where each iteration uses the same number of sources and targets (or particles). Because of this, the kernel eventually takes very large times, and the code crashes with runtime error, which says something like “GPU fallen off the bus”.

The plot showing the behavior of increasing kernel run time with number of iterations can be seen here:

https://docs.google.com/open?id=0B5QLL4ig3LVqODdmVjNBTlp5UFU

I tried to run the NVIDIA “nbody” example to understand if the same thing happens here too, and yes it does. For the number of particles/bodies (Np) = 1e5 and 10 iterations, code runs fine. For Np=1e5 and iterations= 100, OR Np=1e6 and iterations = 10, code goes into a mode where it hangs the entire system.

When I run my own kernel as well as NVIDIA’s nbody example on a different machine with Tesla C2050 NVIDIA card (CUDA Driver version: 3.2, and runtime version: 3.2), there is no problem, and kernel takes the same amount of time for every iteration.

I am trying to understand whats going on in the machine with the K10 GPU. I have tried different combinations of CUDA driver and runtime versions on this machine, and here is what I get:

For 5.0 CUDA Driver, 4.2 Runtime, it just hangs and sometimes says “GPU fallen off the bus”.

For 4.2 CUDA Driver, 4.2 Runtime, the codes (nbody as well as my code) crash with error: “CUDA Runtime API error 39: uncorrectable ECC error encountered.”

For 5.0 CUDA Driver, 5.0 Runtime, it just hangs and sometimes says “GPU fallen off the bus”.

This is a 64-bit linux machine, which we have recently assembled with NVIDIA K10 GPU card. I am using gfortran44 and gcc44.

Please let me know if any other info. is required to track the problem.

Thanks in advance for the help!

M

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T03:24:42+00:00

I’m mostly just creating an answer so we can call this question closed, but I’ll try to add a few details.

Tesla GPUs come in 2 distinct categories: those with a fan, and those without. Those with a fan carry (at this time) the “C” designation, although the K20 product family naming will be slightly different:

These are not exhaustive lists:

Tesla GPUs with a Fan: C870, C1060, C2050, C2070, C2075, K20c (“C Class”)
Tesla GPUs without a Fan: M1060, M2050, M2070, M2075, M2090, K10, K20, K20X (“M class”)

(note that there is currently no K10 type product with a fan or “C” designation)

Tesla GPUs with a fan are designed to be plugged into a wide variety of PC boxes and chassis, including various workstation and server variants. Since they have their own fan, they require a supply of inlet air that is below a certain temperature level, but given that, they will keep themselves cool. As the workload increases, and the generated heat increases, they will spin up their own fan to keep themselves cool. The main ways you can screw up this process are by either restricting the inlet air flow or by putting it in an ambient air environment that is hotter than its max inlet spec.

Tesla GPUs without a fan have something called a passive heatsink and they cannot keep themselves cool independently and take a passive role in the cooling process. They still have a temperature sensor, but it becomes the responsibility of the server BMC (baseboard management controller) to monitor this temperature sensor (this is done directly at the hardware/firmware level, independent of any OS or any activity being directed at the GPU), and to direct a level of airflow over the card that is sufficient to keep the card cool based on it’s indicated temperature. The BMC does this by ramping up whatever fans are designed into the server chassis that control airflow over the GPU. Normally there will be shrouding/ducting within the chassis to aid in this process. Server manufacturers integrating these cards have a variety of responsibilities and must follow various technical specifications from NVIDIA in order to make this work.

If you happen to get your hands on a Tesla GPU without a fan and just slap it in some random chassis, you’re pretty much guaranteed to have the behavior as described in this question. For this reason, Tesla “M” series and “K” series GPUs are normally only sold to OEMs who have undergone the qualification process.

Since the average sysadmin/system assembler is not likely to devise a suitable closed loop fan control system and normally does not have easy access to the necessary specifications defining the temperature sensor and access method, the only klugey workaround if you have one of these that you simply must play with, is to direct a high level of continuous airflow over the card, in whatever setting you put it. Be advised, that this will most likely be noisy. If you don’t have a noisy level of airflow, you probably do not have enough airflow to keep a card cool that is in a high workload situation. In addition, you should probably keep an eye on GPU temps. Note that the nvidia-smi method for monitoring GPU temps does not work for all M class GPUs (i.e. GPUs without a fan). Unfortunately, the method of temperature sensor access in Fermi and prior for the M class GPUs (different than the C class GPUs) was such that it could not be readily monitored in-system via the nvidia-smi command, so in these cases you will get no temperature reading from nvidia-smi, making this approach even harder to manage. Things changed with the Kepler generation, so now the temperature can be monitored both by the nvidia-smi method and by the server BMC at the hardware/firmware level.

C class products with a fan have a temperature that can be monitored with nvidia-smi, regardless of generation. But this is normally not necessary since the card has it’s own control system to keep itself cool.

As mentioned in the comments, all GPUs also have a variety of protection mechanisms, none of which are guaranteed to prevent damage. (If you throw the card in a fire, there’s nothing to be done about that.) But the first typical mechanism is thermal throttling. At some predefined high temperature near the maximum safe operating range of the GPU, the GPU firmware will independently reduce its clocks to attempt to prevent further temperature rise. (If the card is clocked slower, then generally it’s ability to generate heat is also somewhat reduced.) This is a crude mechanism, and when this thermal throttling occurs, something in the cooling arena is already wrong. The card is designed to not enter thermal throttling ever, under normal operating conditions. If temperatures continue to rise (and there is not much headroom at this point), the card will enter it’s final protection mode which is to halt itself. At this point the GPU has become unresponsive to the system, and at the OS level, messages like “gpu has fallen of the bus” are typical. This means cooling has failed and protection mechanisms have failed.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to run my code on NVIDIA’s K10 GPU. I am using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply