How do I train Stable Diffusion on multiple GPUs in the cloud?

Stable Diffusion is a powerful (and computationally heavy) image generation model. If you want to train or fine-tune Stable Diffusion – whether it’s training from scratch, fine-tuning on a custom dataset, or doing something like DreamBooth with your own images – you might find that a single GPU isn’t enough or is too slow. That’s where multiple GPUs come in. Training diffusion models on multiple GPUs in the cloud can drastically speed up the process, but it also introduces some complexity. Here, we’ll explain strategies to scale Stable Diffusion training across GPUs, and how to do it on a cloud platform like Runpod.

Why Scale Stable Diffusion Training to Multiple GPUs?

Stable Diffusion, especially newer versions (like Stable Diffusion 2.0 or SDXL), is a large model. Training it involves huge amounts of matrix multiplications and a lot of image data. There are a few reasons to use multiple GPUs:

Speed: The simplest reason – 2 GPUs can (ideally) process ~2× the data of 1 GPU in the same time. If a single GPU would take 10 hours to go through your dataset for one epoch, 2 GPUs might cut that to ~5 hours (there is some overhead, so it’s not perfect linear scaling, but it’s significant). For long training runs or fine-tuning runs, this time savings is gold.
Memory: Each GPU has limited VRAM. Stable Diffusion’s model (the U-Net, text encoder, etc.) plus large image batches can barely fit on, say, a 16GB GPU if you want a decent batch size. By using multiple GPUs, you can effectively have a larger combined memory pool if needed. Commonly with data parallelism (more on that below), each GPU gets a copy of the model and handles a portion of the batch. So you don’t exactly pool memory for one instance, but you can increase batch size overall without running out on a single GPU (each GPU handles some images). If doing model parallel (less common for SD), you could split parts of the model across GPUs.
Large Models / SDXL: The latest Stable Diffusion XL (SDXL) model is bigger than the original SD1.x models. It’s more demanding to train. People have found that multi-GPU setups (or at least an A100 80GB) are needed to train SDXL efficiently. If you’re fine-tuning SDXL, you might leverage multiple GPUs to handle the larger model and larger image sizes.
Distributed Experiments: You might want to try different hyperparameters or augmentations in parallel. While not exactly the same as training one model on multiple GPUs, cloud platforms allow spinning up multiple GPU instances. But focusing on a single training: multi-GPU can simply accelerate iteration, which means more experiments done in less time.

A note: If you’re just doing a small DreamBooth (a couple hundred images to inject a concept), a single GPU (like a 16GB or 24GB) might be enough. Multi-GPU shines when you either have lots of data or are doing many training steps.

Approaches to Multi-GPU Training

To use multiple GPUs effectively, you typically employ one of the parallelism strategies:

Data Parallelism: This is the most common. Each GPU gets a copy of the model. Your training batch is split among the GPUs. For example, if your batch size is 32 and you have 2 GPUs, each GPU might process 16 images and compute gradients on those. Then the gradients are averaged (or summed) across GPUs to update the single model (which is kept in sync across GPUs). Frameworks like PyTorch provide Distributed Data Parallel (DDP) modules to handle the communication (all-reduce of gradients). PyTorch Lightning and Hugging Face Accelerate also can set this up for you. Data parallelism is straightforward and works well if your model fits on one GPU. Stable Diffusion does fit on one GPU (with FP16 it’s around 7-10GB for the model weights plus activations depending on batch), so this is a good method. Bottom line: you get nearly linear speed-up until communication overhead or other factors kick in.
Model Parallelism: This is where you split the model across GPUs. For instance, half the layers on GPU0, half on GPU1. This is only needed if the model is too big for one GPU’s memory. Stable Diffusion typically doesn’t require this (unless maybe you try to train at insane resolutions or something). Model parallelism is more complex to implement and often slower due to heavy communication (activations have to pass between GPUs). Not usually used for SD, but frameworks like DeepSpeed or Megatron can do it for huge models (mostly used in language models).
Pipeline Parallelism: A specific kind of model parallel where different GPUs handle different stages of the forward/backward pass in a pipeline fashion. Also usually for extremely large models. Probably overkill for Stable Diffusion.
Hybrid Parallelism: Combining the above (like model + data parallel for extremely large-scale). Not necessary for SD in most cases.

In practice, data parallelism is king for Stable Diffusion multi-GPU training. You simply increase your batch size by using more GPUs, which speeds up epoch time. The training code (like scripts from HuggingFace or the diffusers library or the famous kohya ss scripts for fine-tuning Stable Diffusion) often have built-in support for DDP. For example, you might launch the script with torchrun --nproc_per_node=2 train.py ... to use 2 GPUs. The script will then do data parallel training. If using PyTorch Lightning, you’d specify devices=2, strategy="ddp" and it’s handled.

One external insight: a multi-GPU training of Stable Diffusion was found to yield performance benefits but not 100% efficiency. For instance, using two GPUs, each iteration might be a bit slower due to overhead, but since two images are processed at once, you still come out ahead overall – one test showed ~36% faster training with 2 GPUs after adjusting for the overhead . The key is to handle the “epochs multiply” issue: by default, if you don’t adjust, two GPUs might end up doing double work because each sees the whole dataset. The solution is usually to set the number of training steps or epochs accordingly (or use the script’s built-in distributed support which usually handles it). The kohya SS training scripts note that you should adjust for the GPU count to avoid over-counting steps .

In summary, the approach is: use data parallel, and ensure your training loop is aware of the number of GPUs so it doesn’t accidentally double count epochs.

Setting Up Multi-GPU Training on Runpod

Now, how to do this on Runpod’s cloud:

Option 1: Multi-GPU instance – Runpod allows you to launch a single pod (instance) with multiple GPUs (depending on availability and type – e.g., 2× or 4× RTX 3090, or multi-A100 setups). This is the easiest for distributed training because the GPUs are on the same machine with high-speed interconnect (often NVLink or at least PCIe). You don’t have to deal with networking between servers. You simply launch your training as described (with torchrun or using Lightning, etc., which will see multiple GPUs locally).

Steps example: In the Runpod console, when configuring a pod, choose an instance type that has 2 or more GPUs. Once the pod is up, your environment (say an Ubuntu with PyTorch installed) will see CUDA_VISIBLE_DEVICES=0,1 etc. Then run your training script with the appropriate command for multi-GPU. If using the Accelerate library from HuggingFace, you might run accelerate launch --multi_gpu ... train.py. If using Lightning, set env variables or trainer flags. If using raw PyTorch DDP, use torchrun or python -m torch.distributed.run.

One thing to remember is to set your batch size such that each GPU gets a reasonable amount of data. If you had batch_size 32 on 1 GPU, for 4 GPUs you might use batch_size 32 per GPU (leading to effectively 128 total, if your code accumulates across GPUs). Or some scripts expect you to lower the batch size per GPU and they multiply by GPUs. Read the documentation of the training script to know how it handles it.

Option 2: Multiple single-GPU instances – This is like multi-node training. It’s more advanced because you need the GPUs on different machines to talk over network. Runpod does support networking (especially if you use their Community Cloud or if you set up a VPN between pods). However, setting up true multi-node distributed training is complex: you need to specify master addresses, ensure low network latency, etc. For most users, I’d recommend sticking to a multi-GPU single instance if possible, as it’s simpler. If you did need to do multi-node (say you want 8 GPUs but Runpod only has 4 GPU per machine, so you use 2 machines), you would use PyTorch’s distributed launch with appropriate --nnodes and addresses. It can be done, but again – overheads might be higher because unless those machines have InfiniBand, the gradient sync could be slower. Multi-node only really needed for extremely large or heavy jobs.

Runpod specifics: The nice thing is you’re not fixed – you can choose an instance type that fits your budget. If you want to fine-tune stable diffusion quickly, maybe a 2×A100-40GB instance is great. If doing it on a budget, maybe 2×RTX 3090. Check if Runpod has any preset images for Stable Diffusion training. Often, community images (like one with Automatic1111) are geared for inference, but you might adapt them for training too (they have the dependencies). Alternatively, use a PyTorch base image and install the training libraries (diffusers, etc.).

Handling Bottlenecks and Tips

Data Loading: When using multiple GPUs, each GPU process will load its portion of data. Ensure that your data pipeline can feed them efficiently. If reading from disk, multiple workers or caching might be needed. If using a fast cloud disk or cache, great. You don’t want GPUs to wait on data.

Synchronization overhead: There is a communication cost each step (to sync gradients). On a single machine with NVLink, this is pretty fast, but it’s not zero. That’s why sometimes you don’t get a full 2× speed with 2 GPUs – maybe more like 1.8×. Still worth it. You can amortize overhead by using larger batch per GPU if memory allows (so fewer syncs relative to work done).

Precision and Memory: Use mixed precision (FP16 or BF16) training to save memory – this is pretty much standard with stable diffusion fine-tuning. This will allow larger batches and/or more U-Net layers in memory, making use of GPUs’ tensor cores as well. All modern GPUs support this well.

Monitoring: Keep an eye on GPU utilization (you can run nvidia-smi in watch mode). Ideally, all GPUs should be similarly loaded. If one GPU is doing more work than others (unlikely in a well-set DDP, but possible if data is uneven), that could slow the whole process (since sync waits for the slowest). Also monitor VRAM usage on each – they should be roughly the same.

DeepSpeed/FSDP: There are advanced tools like DeepSpeed and Fully Sharded Data Parallel (FSDP) that can help with really large model training by sharding weights, etc. For Stable Diffusion fine-tuning, these might be unnecessary complexity. But if you attempted to train from scratch with insane resolution or something, you could look into these. The Puget Systems experiment mentioned they tried DeepSpeed and FSDP but didn’t get it working easily – which is a reminder that sometimes simpler (DDP) is better, unless you truly need those optimizations.

Example: Fine-Tuning with Multiple GPUs

Let’s say you have a dataset of 5,000 art images and you want to fine-tune Stable Diffusion on that style. On a single GPU, maybe it takes 5 hours for an epoch. You decide to use 2 GPUs to cut that down. Here’s how that might go:

You launch a Runpod instance with 2× NVIDIA 3090s.
You prepare your training script (could be the one provided by Hugging Face’s diffusers library, which has a --distributed flag or uses Accelerate).
You run accelerate config to set up distributed config (it’ll ask how many GPUs etc., then you run accelerate launch train_text_to_image.py ...).
The script automatically splits the batch and uses both GPUs. You monitor and find each GPU is ~90% utilized, memory ~10GB/24GB, which is good.
After training, you get your fine-tuned weights likely twice as fast as with one GPU. You shut down the instance to stop billing.
If you had encountered some slowdown, you might realize you need to adjust the max_train_steps because with 2 GPUs you were effectively doing double steps. Many scripts handle this by default, but it’s something to check (the Puget blog highlighted how not adjusting epochs can double compute ). Usually, you specify a number of steps or a learning rate schedule that’s not dependent on GPU count explicitly.

Use Case: Training Stable Diffusion from Scratch

This is extremely ambitious (the original Stable Diffusion was trained on 256 GPUs!). If you’re just curious: training from scratch a model as large as SD would require multi-node, lots of GPUs, and a huge dataset (LAION images). It’s probably not feasible for an individual. However, you might do a smaller-scale version (like training a diffusion model on a smaller image domain). Multi-GPU helps but consider diminishing returns beyond a point if not on proper cluster infrastructure.

For most, fine-tuning or LoRA training is the target, and multi-GPU can even help there if dataset is large or you want to iterate quickly on experiments.

Saving and Checkpointing

When training on multiple GPUs, ensure your code is saving checkpoints in a way that’s not duplicated or conflicting. Often one process is rank 0 and will do the saving. Just be aware so you don’t end up with multiple processes trying to write the same file (most libraries handle this by only letting one rank save). On Runpod, save to the persistent storage (if you have one mounted) or download the model after training.

FAQ: Multi-GPU Stable Diffusion Training

Q: Do I need multiple GPUs to fine-tune Stable Diffusion, or can I just use one?

A: You can absolutely use one GPU for fine-tuning. Many people fine-tune Stable Diffusion on a single 8GB or 16GB GPU by using smaller batch sizes and techniques like gradient accumulation. Multi-GPU is not a requirement; it’s an optimization for speed. If your hardware/time budget allows one GPU, you can still get the job done. It just might take longer. Multi-GPU comes into play when you have lots of data or want to try many iterations quickly. For instance, if you’re doing a personal project to train a style, one GPU overnight might be fine. If you’re a researcher experimenting with many runs or a professional needing results faster, multi-GPU helps. On Runpod, if cost is a concern, you could even consider using one GPU but a very powerful one (like an RTX 4090) to approach the performance of 2 lower GPUs. It really depends on your scale.

Q: How do I launch a multi-GPU training job on Runpod – can I do it via the API/CLI?

A: Yes. If you prefer not to click around the console, Runpod’s API allows you to request instances of a certain type and quantity. You could request a specific machine type with 2 or 4 GPUs via the API. After that, you could either SSH in or use an entrypoint script to automatically start training. Another approach is to use Docker: you could build a Docker that, on container start, looks at how many GPUs are available (through environment variables or nvidia-smi) and launches the training. Using the API, you’d then get logs or set it to save outputs to persistent storage. This is more advanced automation. For many users, just launching a pod with multiple GPUs and running the commands in a Jupyter notebook or terminal is sufficient. But automation is possible for repeated jobs – e.g., if you wanted to run multiple experiments on separate pods in parallel.

Q: What instance types should I choose on Runpod for multi-GPU?

A: It depends on availability and your needs:

For 2 GPUs: Possibly two RTX 3090s or 4090s is a great setup for SD fine-tuning. Or two A5000s/A6000s (if available). If you need more memory, two A100s (40GB) are excellent but pricier.
For 4 GPUs: Some community providers on Runpod might offer 4×RTX 3080 or 4×3090 rigs. These can be cost-effective and strong. 4 GPUs would really speed things up, but ensure your training script and hyperparams are tuned for that scale (you might be doing huge batches).
For 1 GPU vs 2 GPU: sometimes it’s better to get a single GPU with more memory than 2 smaller ones, if your bottleneck is memory. For example, if training SDXL, a single A100 80GB might be better than 2×3090 (24GB each) because SDXL might need more than 24GB for a good batch size or else you’re stuck with small batches that scale poorly. Evaluate if you are memory-limited or compute-limited.

Runpod’s pricing can guide you: check the $/hour for a given multi-GPU instance and the expected speed-up. If two 3090s cost exactly 2× a single 3090, and give you ~1.8× speed, that’s slightly more cost per training job, but maybe worth the time saved. If they cost less than 2× (maybe some discount for multi-GPU usage), then it’s a win for both time and cost in the long run.

Q: My multi-GPU training isn’t faster (or not much faster) than single – what could be wrong?

A: A few troubleshooting tips:

Check that the workload is evenly distributed. If one GPU is doing all the work (e.g., if you forgot to use DDP and it’s just using GPU0 only), obviously you’d see no speedup. Make sure you launched the script with the correct multi-GPU command.
Communication overhead might be dominating. If your batch per GPU is too small, the GPUs spend more time syncing than computing. Solution: increase batch size per GPU if possible, so each does more work per step. Also ensure you’re using efficient backend (NCCL is default for PyTorch – it’s optimized).
If you are using an older GPU architecture with slower interconnect (PCIe) and huge gradients, maybe that’s limiting. Usually not an issue with 2 GPUs, but with 4 or more on heavy tasks, bus saturation could occur. NVLink bridges or using GPUs with NVLink (like A6000 with NVLink) can help. On cloud, you might not control that aside from picking certain instance types.
Also verify that your evaluation or saving checkpoints isn’t running on a single GPU and becoming a bottleneck. Sometimes after each epoch, if one GPU is doing validation on the whole model or saving, others idle. It’s minor usually, but could appear as slow scaling.
If using multiple nodes (multi-machine), network latency and bandwidth could severely hurt performance if not on a high-speed network. That’s why single-machine multi-GPU is strongly preferred for training stable diffusion unless you have InfiniBand networks.
Finally, as mentioned, ensure you didn’t accidentally double the work: if you see training taking same wall-clock time but each epoch now processes twice the data, actually you did get a speedup, you just maybe mis-counted epochs. Always measure by how many samples/second or how long for a fixed number of images processed, not just time per epoch if epoch definition changed.

Q: Can I do distributed DreamBooth or LoRA training?

A: Yes, though often DreamBooth (which fine-tunes on like <100 images) doesn’t need multi-GPU because it’s a small dataset. You could, but it might be overkill – the overhead of multi-GPU might not be worth it when you can finish a DreamBooth in like 10-20 minutes on one GPU. LoRA fine-tuning on a larger set (like fine-tuning SD on a new art style with hundreds/thousands of images) could benefit from multi-GPU to go through the images faster. The process is the same – it’s just regular training under the hood (just possibly freezing some weights or applying LoRA adapters). If using the kohya scripts for LoRA, they do support DDP. People have done LoRA training on multi-GPU for speed, especially when doing many fine-tunings.