We want to extend our batch system to support GPU computations. The problem is

Question

0

Asked: May 17, 20262026-05-17T23:36:42+00:00 2026-05-17T23:36:42+00:00

We want to extend our batch system to support GPU computations. The problem is

0

We want to extend our batch system to support GPU computations.

The problem is that from the batch system viewpoint, the GPU is a resource. We can easily count used resources, but we also need to limit the access to them.

For GPUs that means that each job claims a GPU exclusively (when a GPU is requested).

From what I have been told, sharing GPUs between jobs is a very bad idea (because the GPU part of jobs might be killed randomly).

So, what I need is some way to limit access to GPUs for CUDA and OpenCL. The batch system has root privileges. I can limit access to devices in /dev/ using cgroups but I figured, that this won’t be enough in this case.

Ideal state would be if the job would only see as many GPUs as it requested and these couldn’t be accessed by any other job.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T23:36:43+00:00

There are two relevant mechanisms at the moment:

Use nvidia-smi to set the devices into exclusive mode, that way once a process has a GPU no other process can attach to the same GPU.
Use the CUDA_VISIBLE_DEVICES variable to limit which GPUs a process sees when it looks for a GPU.

The latter is of course subject to abuse but it’s a start for now.

From what I have been told, sharing GPUs between jobs is a very bad idea (because the GPU part of jobs might be killed randomly).

Not really, the main reason that sharing the GPU is a bad idea is that they will have to compete for the available memory and the processes may all fail, even though in reality one of them could have proceeded. In addition, they compete for access to the DMA and compute engines which can result in poor overall performance.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We want to extend our batch system to support GPU computations. The problem is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply