.jpeg)
Multi-Instance GPUs on Runpod: Stop Paying for Compute You Don't Need
With MIG, we can partition RTX 6000 Pro cards into isolated 24 GB instances. Here's when it makes sense for your workloads.
Blog
Slurm on Runpod Clusters makes it simple to scale distributed AI and scientific computing across multiple GPU nodes. With pre-configured setup, advanced job scheduling, and built-in monitoring, users can efficiently manage training, batch processing, and HPC workloads while testing connectivity, CUDA availability, and multi-node PyTorch performance.

Slurm (Simple Linux Utility for Resource Management) is a powerful job scheduler and resource manager designed for high-performance computing (HPC) environments. When combined with Runpod's Clusters, it provides a robust platform for managing distributed AI workloads, scientific computing, and batch processing tasks across multiple GPU nodes.
Runpod Clusters can come with Slurm pre-configured, featuring automatic cluster setup with one node designated as the "Slurm Controller" and additional nodes as "Slurm Agents." This guide will walk you through deploying, configuring, and testing a Slurm cluster on Runpod.
While you could theoretically manage a cluster using direct TCP/IP connections and manual job distribution, Slurm offers several critical advantages that make it indispensable for serious distributed computing:
Resource Management and Fair Sharing: Slurm intelligently allocates GPUs, CPUs, and memory across jobs, preventing resource conflicts and ensuring optimal utilization. Unlike manual clustering approaches, Slurm can handle complex scenarios where multiple users submit jobs simultaneously, automatically queuing work and allocating resources based on priority, user limits, and availability.
Job Scheduling and Queue Management: Rather than manually coordinating when and where jobs run, Slurm provides sophisticated scheduling algorithms that can optimize for throughput, fairness, or specific priority schemes. It handles dependencies between jobs, manages job arrays for parameter sweeps, and can automatically retry failed jobs - capabilities that would require significant custom development in a basic TCP/IP cluster setup.
Fault Tolerance and Monitoring: Slurm provides built-in monitoring, logging, and fault tolerance that goes far beyond simple network connectivity checks. It tracks job completion, resource usage, and can automatically handle node failures by rescheduling work to healthy nodes. The system maintains detailed accounting records and can generate usage reports - essential features for production environments that are absent in basic clustering approaches.
Integration with Distributed Frameworks: Modern AI frameworks like PyTorch, TensorFlow, and scientific computing tools are designed to work seamlessly with Slurm's environment variables and resource allocation model. This integration handles the complex coordination required for distributed training - managing process ranks, setting up communication backends like NCCL, and coordinating initialization across nodes - tasks that would require extensive custom coordination logic in a manual TCP/IP setup.
Runpod Clusters come with an option to have Slurm pre-configured, featuring automatic cluster setup with one node designated as the "Slurm Controller" and additional nodes as "Slurm Agents."
To set up a cluster with Slurm enabled, all you need to do is select Slurm Cluster in the cluster type dropdown when deploying.

Once deployed, you'll see your cluster nodes listed. The controller node will be labeled "Slurm Controller." You can connect to the Slurm Controller node via Jupyter Notebook and open up a Terminal for ease of access, though any method of connecting via a terminal will work just fine.

Run this comprehensive check immediately after connecting:
You should see the following:
Notably, you’ll want to make sure you have the IP addresses of all your nodes in your cluster, as we’ll need this later. In this case, they are:
You can also use sinfo to verify that all nodes are up and running:
Create a directory for your Slurm jobs:
The easiest way to run and submit jobs on Slurm is to save them as .sh files, and then to submit them with the sbatch command.
For example, here is a script to test communication between nodes. You can just run this in the terminal and it will write everything directly to the .sh or .py files as needed.
Here is a script to test distributed Pytorch training across nodes:
You can run these, or any other scripts you create using the following commands:
The results of any given job run will be written to .out files (or .err files if there is an error). Here’s what you should see in the output when testing the distributed training test:
This guide walked you through a systematic approach to deploying and validating Slurm on Runpod Clusters. We started with basic cluster deployment and connectivity verification, then progressively tested more complex distributed computing scenarios. The connectivity test confirmed that nodes could communicate and execute jobs across the cluster, while the GPU detection test verified that CUDA resources were properly allocated and visible to Slurm jobs.
The single-node PyTorch test established that the AI framework stack was functioning correctly on individual nodes, providing a foundation for distributed testing. The multi-node PyTorch distributed test represented the culmination of our validation process, demonstrating that NCCL could successfully coordinate tensor operations between nodes using the high-speed ens1 network interface. The successful all_reduce operation - where tensors from different ranks were properly aggregated across nodes - confirmed that the cluster is ready for real distributed training workloads.
Blog Posts
.jpeg)
With MIG, we can partition RTX 6000 Pro cards into isolated 24 GB instances. Here's when it makes sense for your workloads.

How 1,100 researchers beat OpenAI's own baseline with 16 megabytes and 10 minutes.

Learn how to set up a real-world agentic system with our new Flash framework.