We want to extend our batch system to support GPU computations.
The problem is that from the batch system viewpoint, the GPU is a resource. We can easily count used resources, but we also need to limit the access to them.
For GPUs that means that each job claims a GPU exclusively (when a GPU is requested).
From what I have been told, sharing GPUs between jobs is a very bad idea (because the GPU part of jobs might be killed randomly).
So, what I need is some way to limit access to GPUs for CUDA and OpenCL. The batch system has root privileges. I can limit access to devices in /dev/ using cgroups but I figured, that this won’t be enough in this case.
Ideal state would be if the job would only see as many GPUs as it requested and these couldn’t be accessed by any other job.
There are two relevant mechanisms at the moment:
The latter is of course subject to abuse but it’s a start for now.
Not really, the main reason that sharing the GPU is a bad idea is that they will have to compete for the available memory and the processes may all fail, even though in reality one of them could have proceeded. In addition, they compete for access to the DMA and compute engines which can result in poor overall performance.