← Back to articles
News· 2 min read

Slurm on OpenNebula: HPC batch scheduling for AI training

OpenNebula has published a walkthrough on running Slurm, the batch scheduler most HPC sites already rely on, on top of its platform to train AI models on GPU infrastructure. The point is to hand researchers and data scientists the job-queue interface they know, without asking them to change their workflow, while OpenNebula handles the hardware underneath.

There are two paths. By default, the Slurm appliances run inside virtual machines managed by OpenNebula. That keeps VM isolation and lifecycle intact, and GPU access goes through PCI passthrough for near-metal accelerator performance. For teams that want more, OpenNebula is expanding its integration with NVIDIA Infra Controller (NiCo) to bring Slurm onto bare-metal nodes.

What ships

The concrete news is a pair of Marketplace appliances: a Controller and a Worker. Both handle two things that usually mean manual setup: Munge authentication between cluster nodes and OpenNebula OneGate integration, done automatically. Standing up a Slurm cluster stops being a hand-configuration job.

For high-performance networking, the post covers NVIDIA Quantum InfiniBand exposed to the VMs through SR-IOV or, again, PCI passthrough. Elastic scaling lets you spin up additional workers from a template when demand spikes, so a GPU worker pool grows with the pending job load.

As a practical demo, the post fine-tunes a language model with Unsloth on a small model. That is enough to show the full loop: submit the job to the Slurm queue, let the scheduler place it on a GPU worker, and collect the result.

OneSlurm, still in preview

The article also previews OneSlurm, a managed component that is not generally available yet. Its goal is to simplify deployment, operation, and lifecycle management of Slurm clusters on OpenNebula-managed infrastructure. The reference docs point to version 7.0, and the work sits within OpenNebula 7.2.

Who it’s for

This fits HPC research centers, large AI research labs, and AI Factory environments that already have GPUs and want to give their teams a Slurm job queue without building a dedicated cluster from scratch. If you come from Linux virtualization, the foundation is the usual one: KVM with QEMU and libvirt below. For a refresher on how those layers fit, see KVM, QEMU and libvirt on RHEL. And when the GPU goes through PCI passthrough, it helps to know the KVM hypervisor entry.

Source

Original article from OpenNebula: Slurm on OpenNebula: HPC Batch Scheduling for AI Training (June 23, 2026). The virtualization engine underneath is KVM, which acts as the aggregator for the VMs where the Slurm appliances run.