SLURM Support for Remote GPU Virtualization: Implementation and Performance Study

Authors:	Iserte Agut, S., Castelló Gimeno, A., Mayo Gual, R., Quintana Ortí, E.S., Silla, F., Duato, J., Reaño, C., Prades, J.
Conference:	26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAC 2014)
Location and date:	Paris (Fracia), October 2014

Abstract

SLURM is a resource manager that can be leveraged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plugin (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.

Multiphase Fluids Group

SLURM Support for Remote GPU Virtualization: Implementation and Performance Study

Research topics

Key words

Search