Emmett Fear

Optimizing Docker Setup for PyTorch Training with CUDA 12.8 and Python 3.11

Intermediate AI developers training large language models (LLMs) need a robust Docker environment for GPU-accelerated workloads. This guide walks through setting up a Docker container with CUDA 12.8 and Python 3.11 for PyTorch and Hugging Face Transformers, optimized for multi-GPU LLM training on Runpod (both Secure Cloud and Community Cloud). We’ll cover choosing a base image, building a Dockerfile, runtime configuration for multi-GPU, testing, deploying on Runpod with persistent storage, and an FAQ addressing common concerns.

Compatible Base Images

When building a Docker image for PyTorch with CUDA 12.8 and Python 3.11, start with an Ubuntu-based image that includes the correct CUDA libraries. NVIDIA provides official CUDA images that are well-tested for GPU workloads:

  • NVIDIA CUDA Runtime – e.g. nvidia/cuda:12.8.0-cudnn-runtime-ubuntu22.04 (Ubuntu 22.04 with CUDA 12.8 and cuDNN). This has the CUDA runtime and cuDNN libraries needed for PyTorch.
  • NVIDIA CUDA Developer – e.g. nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04. Includes the full CUDA toolkit (compiler and headers). Use this if you plan to compile custom CUDA kernels or libraries (e.g. certain LLM optimization libraries) inside the container.
  • PyTorch Pre-built Image – PyTorch offers images on Docker Hub with specific CUDA versions. For example, pytorch/pytorch:2.7.0-cuda12.8-cudnn8-devel-ubuntu20.04 comes with PyTorch 2.7.0 and CUDA 12.8 pre-installed. These can save time, but note the OS (Ubuntu 20.04 in this case) and Python version (you may need to install Python 3.11 if not included).

For most cases, an Ubuntu 22.04 base with CUDA 12.8 is a solid foundation. Ubuntu 22.04 doesn’t ship with Python 3.11 by default, but we can install it manually (shown in the Dockerfile below). Ubuntu 24.04 images are also available (with CUDA 12.8 on Ubuntu 24.04), but those use Python 3.12 as default, which isn’t needed for our scenario. Sticking to Ubuntu 22.04 ensures broad compatibility and allows us to add Python 3.11 easily.

Why Ubuntu-based? Ubuntu is widely used for AI development, and many PyTorch/LLM library binaries are tested on Ubuntu. Using an Ubuntu base image with the correct CUDA drivers ensures compatibility with NVIDIA GPUs on Runpod . NVIDIA’s base images already include the necessary GPU driver interfaces, which will match the host GPUs on Runpod (just make sure the host’s driver supports CUDA 12.8, which is the case on Runpod for modern GPUs) .

Dockerfile Example

Below is a Dockerfile that sets up CUDA 12.8, Python 3.11, PyTorch (with appropriate CUDA support for LLM training), and Hugging Face Transformers. This Dockerfile is optimized for reproducibility and performance:

# Use NVIDIA's official CUDA 12.8 base image with cuDNN on Ubuntu 22.04
FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04

# Set non-interactive mode for apt and install basic dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
software-properties-common build-essential wget curl git ca-certificates && \
add-apt-repository ppa:deadsnakes/ppa && apt-get update && \
apt-get install -y --no-install-recommends python3.11 python3.11-dev python3.11-distutils && \
apt-get clean && rm -rf /var/lib/apt/lists/*

# Install pip for Python 3.11
RUN wget https://bootstrap.pypa.io/get-pip.py && python3.11 get-pip.py && rm get-pip.py

# Install PyTorch, Torchvision, Torchaudio with CUDA 12.8 support, and Transformers
ENV TORCH_VERSION=2.7.0
ENV CUDA_VERSION=12.8
RUN python3.11 -m pip install torch==${TORCH_VERSION}+cu128 torchvision==0.18.0+cu128 torchaudio==2.7.0+cu128 \
-f https://download.pytorch.org/whl/torch_stable.html && \
python3.11 -m pip install transformers==4.34.0

# (Optional) Install additional LLM support libraries
# RUN python3.11 -m pip install accelerate bitsandbytes datasets

# Set Python3.11 as the default python
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1

# Create a workspace directory (for persistent volume mount) and set it as working dir
RUN mkdir /workspace
WORKDIR /workspace

# Default command to keep container alive (so we can exec into it or run commands)
CMD ["sleep", "infinity"]

About this Dockerfile:

  • Base Image: We use nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 which already contains CUDA 12.8 and cuDNN libraries on Ubuntu 22.04. The devel image includes developer tools (useful for any compilation during pip installs). If you prefer a smaller image and are only using pre-built binaries, the runtime image (...-runtime-ubuntu22.04) could be used instead to slim down size.
  • Python 3.11 Installation: We add the deadsnakes PPA and install python3.11 and related packages. This ensures our container uses Python 3.11. We then install pip for Python 3.11. (Ubuntu’s default was 3.10, so this step upgrades us to 3.11.)
  • PyTorch and CUDA: We pin a specific PyTorch version (2.7.0 in this example) that supports CUDA 12.8. We install the CUDA 12.8 specific wheels by using the +cu128 suffix and pointing to the PyTorch wheel index . This ensures that we get a PyTorch build with CUDA 12.8 support (which includes NVIDIA’s Blackwell architecture support for RTX 50xx GPUs, etc.). The version numbers (torch, torchvision, torchaudio) should be mutually compatible – using the official compatibility matrix on PyTorch’s site can help pick correct versions (see the PyTorch CUDA compatibility matrix on the PyTorch website for guidance).
  • Hugging Face Transformers: We install a specific Transformers version (e.g. 4.34.0). Pinning the version helps with reproducibility. Transformers is tested on Python 3.9+ and PyTorch 2.0+, so it’s compatible with Python 3.11 and our PyTorch version.
  • Extra LLM Libraries (Optional): We comment-out an example of installing Hugging Face Accelerate, BitsAndBytes, Datasets, etc. These can be included if needed for training optimization (Accelerate for easier multi-GPU launching, BitsAndBytes for 8-bit inference/training, etc.).
  • Set Python3.11 as default: We update alternatives so that running python3 uses Python 3.11. This is convenient if some tools expect python3 on PATH.
  • Workspace Directory: Create a /workspace directory and set it as WORKDIR. On Runpod, we will mount a persistent volume to this path to save training data and checkpoints.
  • Default Command: The container runs sleep infinity by default. This is a simple way to keep the container running (especially in Runpod) without exiting immediately. We can then exec into the running container (or connect via SSH/Jupyter on Runpod) to run training scripts. Alternatively, you could set the CMD to launch a Jupyter Notebook or an SSH server if desired, but on Runpod the platform can inject those via the template (for example, official Runpod templates often open Jupyter Lab on startup).

Note on image size: The above image will include CUDA, PyTorch, and more, which can be several GB. To optimize size, consider using a multi-stage build: one stage to install and build any binaries, and a final stage FROM a lighter base (like the runtime image) copying only necessary files. Also, removing package caches (apt-get clean, rm -rf /var/lib/apt/lists) and not installing unnecessary locales or docs will help keep the image lean.

Runtime Configuration for Multi-GPU Training

With the Docker image built, it’s crucial to launch it with the correct settings so that all GPUs are accessible and communication libraries (like NCCL) work properly:

  • Docker CLI (local or other platforms): Use the --gpus all flag to allocate all available GPUs to the container. For example:

docker run --gpus all -it --shm-size=1g -p 8888:8888 my-llm-image:latest

  • This command gives the container access to all host GPUs. We also set --shm-size=1g as an example to increase shared memory, which can benefit PyTorch’s multi-process data loaders and NCCL. You might adjust this (or use --ipc=host) for large datasets to avoid shared memory bottlenecks.
  • NCCL and Multi-GPU: PyTorch uses NCCL as the default backend for distributed training (multiple GPUs). NVIDIA’s CUDA images already include NCCL, and the PyTorch we installed is built with NCCL support. Typically, no additional configuration is needed for single-node, multi-GPU training – PyTorch will detect multiple GPUs and you can use DataParallel or DistributedDataParallel. If launching a PyTorch distributed training job manually, you might use torchrun or python -m torch.distributed.launch to spawn processes per GPU. Ensure the container’s environment has NCCL_SOCKET_IFNAME (for multi-node) or other NCCL tuning vars only if needed. For single-node, it usually works out of the box. You can set NCCL_DEBUG=INFO as an environment variable to troubleshoot any multi-GPU communication issues.
  • Community Cloud vs Secure Cloud on Runpod: On Runpod, when you choose a multi-GPU instance (e.g., an 8×A100 pod), the platform will automatically make all those GPUs available to your Docker container. In Secure Cloud, GPUs are in datacenter servers and fully support multi-GPU and NVLink/NCCL operations. In Community Cloud, you’re running on a peer provider’s machine – multi-GPU is supported if the host has multiple GPUs, but the availability of very large multi-GPU rigs might be more limited. Either way, Runpod handles the --gpus all equivalent for you when scheduling the pod. You just need to design your training script to utilize multiple GPUs (using PyTorch DDP, Hugging Face Accelerate, etc.). Both Secure and Community Cloud support NCCL for intra-node GPU communication (they use recent NVIDIA drivers), so multi-GPU training within one pod works the same on both.
  • Environment Variables for Performance: You might consider setting some env vars in your Dockerfile or at runtime for optimal performance:
    • OMP_NUM_THREADS=1 (often set to avoid excess CPU threading overhead with PyTorch).
    • NCCL_P2P_DISABLE=0 / NCCL_IB_DISABLE=0 to ensure peer-to-peer and InfiniBand (if applicable on high-end instances) are enabled for NCCL. On Runpod’s single-node pods, these are usually fine at defaults.
    • TORCH_CUDA_ARCH_LIST if you want to limit compiled kernels to certain GPU architectures (not usually needed unless you want to reduce binary size).
    • In Secure Cloud pods, CUDA_VISIBLE_DEVICES is usually not needed (all GPUs are visible). If you ever want to restrict which GPUs a process uses, you can set CUDA_VISIBLE_DEVICES inside the container.

In summary, launch the container with GPU access enabled and adequate shared memory. On Runpod, simply selecting the desired number of GPUs for your pod will ensure the container sees all of them, and our Docker image (with CUDA and NCCL) is prepared to use them.

Testing and Validation Scripts

After building your Docker image (and running the container), it’s wise to verify that everything is set up correctly. Here are a few simple tests and commands you can run to validate CUDA and PyTorch inside the container:

  1. Check NVIDIA GPU Visibility: Run nvidia-smi inside the container to ensure the NVIDIA driver is accessible. For example:

docker run --rm --gpus all my-llm-image:latest nvidia-smi

  1. This should display the GPUs and their memory usage. If nvidia-smi is not found, you might need to install the NVIDIA CLI tools inside the container (in our Dockerfile, the CUDA base image includes it). On Runpod, you can also check the Metrics tab of your pod for GPU utilization.
  2. Verify PyTorch sees CUDA: Execute a small Python test:

docker exec -it <container_id> python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count()); print(torch.cuda.get_device_name(0))"

  1. This should print True for CUDA available, the number of GPUs (device count), and the name of GPU 0. For example, it might output: True 4 and NVIDIA A100-SXM4-40GB if you have four A100s attached. This confirms that PyTorch can interface with the CUDA driver.
  2. Test a CUDA operation: You can run a quick tensor operation on the GPU:

import torch
x = torch.rand(5, device='cuda')
print("CUDA Tensor:", x)

  1. Running the above in the container (e.g. via python3 or a Jupyter notebook on the pod) should succeed without error and indicate that computations are happening on the GPU.
  2. Check NCCL availability: In a Python session, run:

import torch
print("NCCL Available:", torch.distributed.is_nccl_available())

  1. It should print True. (Note: You don’t need to initialize a process group for this check, just ensure the library is there.) This means NCCL is ready for multi-GPU communication. You could further test multi-GPU functionality by launching a distributed process group. For instance, try running a quick torch.distributed.init_process_group("nccl", rank=0, world_size=1) to see if it initializes without hanging – a basic sanity check for NCCL (on a single process).
  2. Hugging Face Transformers quick check: Test that Transformers library works and can see the GPU:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
print("Model loaded on GPU")

  1. This will download a small model (GPT-2) and move it to the GPU, validating that transformers is installed and working with PyTorch. (Make sure internet is enabled or the model is available; on Runpod Secure Cloud, internet is usually enabled by default unless restricted.)

Running these tests on both a local Docker environment and on a Runpod pod after deployment is a good idea. They ensure that CUDA 12.8 is functioning, PyTorch is using the GPU, and NCCL is ready for multi-GPU training. Catching any issues (like a missing library or wrong version) early will save a lot of headache before you launch a multi-hour training job.

Deploying on Runpod (Secure Cloud & Community Cloud)

Deploying your custom Docker image on Runpod is straightforward. We will outline the steps to get your container running on a GPU pod, and how to attach a persistent volume for training data and checkpoints. We’ll note any differences between Secure and Community Cloud as needed.

1.

Push Your Docker Image to a Registry

First, ensure your image is accessible to Runpod. If you built my-llm-image:latest locally, tag and push it to a container registry like Docker Hub or GitHub Container Registry. For example, on Docker Hub:

docker tag my-llm-image:latest <your_dockerhub_username>/my-llm-image:latest
docker push <your_dockerhub_username>/my-llm-image:latest

Make the image public (or if private, you’ll need to provide Runpod with credentials via a Secret). Runpod can pull from Docker Hub, Google Artifact Registry, AWS ECR, etc., as long as it has access.

2.

Create a Persistent Volume (Secure Cloud)

If you plan to save datasets or model checkpoints, using a Network Volume on Runpod is highly recommended (for Secure Cloud pods). A network volume is persistent storage that remains available even if your pod is terminated . This is currently supported on Secure Cloud; Community Cloud pods do not support network volumes in the same way. If you are using Community Cloud, you may skip this and plan to manually download/upload data or keep the pod running to retain data (though that incurs costs).

To create a network volume on Runpod Secure Cloud:

  • Navigate to the Storage or Volumes section in the Runpod web dashboard (under Secure Cloud).
  • Click Create Volume. Give it a name (e.g., llm-data), choose a data center region (ideally the same region as your pod for performance), and specify a size (e.g., 100 GB for plenty of space).
  • Once created, note the volume name. You will mount it to /workspace (or your chosen path) in the container.

Why /workspace? By convention, Runpod’s official templates use /workspace as the mount point for user data. In our Dockerfile, we set up /workspace for this purpose. Anything written under /workspace will persist on the volume , and you can reuse that data across pod restarts or even attach the volume to a different pod in the same region.

3.

Launch a Pod with Your Custom Image

Now, deploy your container on a Runpod GPU instance:

  • Go to the Pods page on Runpod and click Deploy (choose Secure Cloud or Community Cloud depending on your choice). If you want to use the volume created, ensure Secure Cloud is selected since volumes are available there.
  • Select GPU Type & Count: Choose a GPU instance that suits your training (e.g., an A100 or H100 for large models, or an RTX 4090 for smaller experiments). You can filter GPUs by VRAM in the UI. Also select the number of GPUs (e.g., 4 for a 4×GPU pod) – note that Community Cloud might have fewer multi-GPU options.
  • Container Image / Template: In the deployment form, you need to specify the container image. There are two ways:
    • Use a Custom Template: Click on Templates and create a new template. Provide a name (e.g., “LLM Training PyTorch CUDA12.8”), and in the Image Name field, enter your image (e.g., docker.io/<your_dockerhub_username>/my-llm-image:latest). Set the Container Disk Size (this is the ephemeral disk for the container’s OS; e.g., 10 GB is usually enough since our image has everything built in). If you created a network volume, set Volume to the name/ID of the volume or specify the size if prompted, and Volume Mount Path to /workspace. Also, you can specify Ports (e.g., 8888/http if you plan to run Jupyter, or 22/tcp for SSH) and any Environment Variables your container might need. Save the template, then select this custom template for your pod.
    • Quick Deploy with Image: Alternatively, some Runpod interfaces allow directly entering a custom image without making a named template. Look for an option like “Custom Image” and enter the full image name. Then specify the volume mount path and size in the deploy form. (If the UI does not allow direct image entry, use the template method above.)
  • Instance Settings: Choose On-Demand or Spot depending on your budget/need. On-Demand guarantees the pod won’t be interrupted; Spot is cheaper but can be reclaimed (useful only if you can handle interruptions or for non-critical tasks).
  • Deploy: Click Deploy Pod. Runpod will pull your Docker image and start the container. The first time, image pull can take a few minutes if the image is large (it will show a downloading status). Subsequent starts might be faster if the image is cached on the host.

4.

Connect and Run Training

Once the pod status is Running, connect to it to use the environment:

  • Runpod offers a Web Terminal and HTTP VNC/SSH options. For example, you might see a “Connect” button, which gives you an in-browser shell into the container. If you set up Jupyter, you could connect via the listed Jupyter HTTP port.
  • Verify that your volume is mounted: in the shell, do df -h or ls /workspace. You should see your volume’s storage available at /workspace (e.g., any files you pre-loaded or an empty directory if new). Remember, anything you put here will persist. For example, if you download a dataset or your model checkpoints into /workspace, they’ll remain even if you stop/terminate the pod, since they live on the network volume.
  • Navigate to /workspace (already the working directory in our Dockerfile) and start your training. You can git clone your training scripts or scp them onto the pod (Runpod also allows file transfer via their CLI or a tool like rsync ). Then launch your training Python script as you would locally. If using multiple GPUs, ensure your training script uses DistributedDataParallel or a launcher like torchrun --nproc_per_node=4 train.py to utilize all GPUs.

For example, if you have a training script train.py, you could run:

python3 -m torch.distributed.run --nproc_per_node=$(nproc) train.py --output_dir /workspace/checkpoints

This would utilize all GPUs (assuming one process per GPU, and nproc gives number of CPU cores, not GPUs; adjust to actual GPU count in practice). For a simpler approach, the Hugging Face Accelerate library can launch multi-GPU training with accelerate launch.

Monitor the GPU usage with nvidia-smi in another terminal or via Runpod’s metrics to ensure all GPUs are busy. Your checkpoints and logs should be written to the volume (/workspace), so they persist.

5.

Persisting and Reusing Data

On Secure Cloud, the network volume means you can stop or terminate pods and your data remains in the volume:

  • If you Stop the pod (shuts it down but keeps it attached to your account), you will be charged a small fee for the storage of the container image and volume while it’s stopped . You can later Start it again, and it will come up faster (avoiding image pull if already on disk).
  • If you Terminate the pod (delete it), the network volume will still persist separately. You can then create a new pod later and attach the same volume to it (via template or deploy form) to continue your work . This is great for workflows where you spin up pods only when needed but keep the large data (models, datasets) saved.

On Community Cloud, since network volumes are not available, you have a couple of alternatives:

  • Keep the pod running (or in a stopped state) to retain the data in the container’s disk. Note that stopped pods still incur storage cost as mentioned.
  • Periodically save important data to an external location: for example, upload checkpoints to a cloud storage or to Hugging Face Hub, so you can retrieve them later even if the pod is lost. You could also attach a Runpod Network Volume in Secure Cloud to copy data over (if you have both secure and community resources and want to shuttle data).

Deploying on Runpod is designed to be flexible. By using custom templates and volumes , you can recreate your training environment consistently. Both Secure and Community Cloud use the same Docker image approach, so you can develop on one and use the other as needed (keeping in mind data persistence differences).

FAQ

Q1: My Docker image is very large. How can I reduce the image size for faster pulls and cold start times on Runpod?

A: Large images (several GBs) can lead to longer startup times, especially on Community Cloud where the host must download the image each time. To optimize:

  • Use a slimmer base image if possible (e.g., the CUDA runtime image instead of devel if you don’t need to compile code). Ensure it still has the libraries you need (cuDNN, NCCL, etc.).
  • Apply multi-stage builds: compile any needed components in one stage, then copy only the necessary runtime files to the final image. This can drastically cut out build tools and dev libraries from the final image.
  • Clean up after installations: remove cache files (e.g., pip cache purge, apt lists as we did in the Dockerfile, etc.). Every layer counts.
  • Consider compressing model files or downloads if they are part of the image, or downloading large models at runtime to a volume instead of baking them in.
  • On Runpod Secure Cloud, once an image is pulled on a particular node, subsequent startups on that same node are quicker (thanks to caching). However, you don’t control which node you get. Using popular base images (like official CUDA or Runpod PyTorch images) increases the chance they’re already cached on the host. In practice, to reduce cold start time, slim the image and only include essentials. Our example image might be ~8-10 GB; with optimizations you could potentially get a lighter image if needed.

Q2: How can I optimize GPU memory usage during LLM training?

A: Training large language models is memory-intensive. Here are tips to make the most of your GPU memory:

  • Use mixed precision training (FP16 or BF16). PyTorch’s torch.cuda.amp.autocast and gradient scaling can allow training with half precision, nearly halving memory usage for model weights and gradients. Many LLM training scripts (and Hugging Face Transformers Trainer or Accelerate) support enabling mixed precision (--fp16 flag, etc.).
  • Enable gradient checkpointing (also known as activation checkpointing) in your training code. This trades computation for memory by not storing all intermediate activations. It’s common to enable this for deep networks to reduce memory footprint.
  • Utilize paging offloading if available: frameworks like Accelerate or DeepSpeed can offload tensors to CPU or NVMe when not in use. This helps if model size is larger than GPU memory, at the cost of speed.
  • Monitor with nvidia-smi to see if memory is the bottleneck. If you hit OOMs, consider reducing the batch size or sequence length. Batch size has a linear effect on memory.
  • Ensure no memory leaks: wrap training loops in try-except to catch OOM, and use torch.cuda.empty_cache() if necessary (though PyTorch usually manages caching well).
  • If you have multiple GPUs, consider model parallelism (sharding the model across GPUs) for extremely large models, or use libraries like Hugging Face’s Accelerate which can automatically shard large models across GPUs.

Q3: Is Python 3.11 fully compatible with ML libraries like PyTorch and Transformers?

A: Yes. By now (2025), Python 3.11 is well supported by major ML frameworks. PyTorch added support for 3.11 in its 2.x releases and provides wheels for it, and Transformers has been tested on Python 3.11 (Transformers supports 3.9+ ). However, always ensure you install the correct versions of libraries. Some older libraries or custom extensions might have needed updates for 3.11 (due to changes in Python’s internal C API). In our setup:

  • We explicitly installed PyTorch built for CUDA 12.8, which supports Python 3.11 (as confirmed by the PyTorch compatibility documentation and release notes).
  • If you use other libraries (e.g., TensorFlow or JAX), check their Python 3.11 support. TensorFlow was historically slower to adopt new Python versions but by 2025 likely supports 3.11 in latest versions.
  • One thing to watch: if you use pip to install less common packages, and a binary wheel isn’t available for 3.11, pip might try to compile from source (which can fail if the container lacks build tools or the library’s code isn’t updated for 3.11). Our inclusion of build-essential in the Dockerfile helps in case something needs a quick build. Always refer to official documentation for compatibility matrices (for example, the PyTorch CUDA compatibility matrix lists which Python and CUDA versions each PyTorch release supports).

Q4: How can I reduce the container cold start time on Runpod, especially for Community Cloud?

A: Cold start time is influenced by image size and network speed. To reduce it:

  • As discussed in Q1, minimize the image size. A smaller image pulls faster. For instance, if your image is 3 GB instead of 10 GB, that’s a big improvement in start time.
  • Choose the right base: If you use an image that is common or official (like runpod/pytorch or nvidia/cuda), it might already reside on the host. Runpod’s Secure Cloud nodes might pre-cache some official images (though not guaranteed).
  • Warm pooling: On Runpod, using On-Demand instances can sometimes result in your pod being scheduled on a machine that you have used recently, meaning the image layers are cached. There’s no user-controlled way to guarantee this, but running a small test pod with the image and then stopping it might “warm” the cache for a subsequent larger deployment (though on Community Cloud, you’re likely to land on a random provider’s machine).
  • Keep your Docker layers efficient: layers that change frequently (like if you rebuild the image with new code) should be toward the end of the Dockerfile. Layers that never change (like installing base packages) should be at the top, so they can be cached. This doesn’t affect pull time from registry much (that’s more about size), but it affects build time and caching between builds.
  • Finally, consider Runpod Serverless for certain use cases. If you need very fast spin-up and spin-down (for inference rather than training), Runpod’s serverless GPU endpoints might be more appropriate. But for training, Pods are usually the way to go despite the cold start time.

Q5: Why can’t I use persistent volumes on Community Cloud pods?

A: Network Volumes (the persistent volumes in Runpod) are a feature tied to Secure Cloud, where Runpod controls the storage in the data center. Community Cloud providers do not currently offer the same network-attached storage. In Community Cloud, the storage is typically just the local disk on the provider’s machine, which is ephemeral. That’s why when a Community pod is destroyed, its data is gone, and you can’t attach a network volume. If persistent storage is important for your workflow (e.g., you don’t want to re-download data each time or lose checkpoints), you might prefer Secure Cloud pods for that capability . Secure Cloud ensures reliability and data persistence by allowing volumes, at a slightly higher cost than Community. One workaround on Community is to use the Runpod Volume Sync feature – you can manually sync a folder to a cloud bucket (AWS S3, GCP, etc.) using the provided integration , or just script your own upload to cloud storage at the end of training.

By following this guide, you should have a tailored Docker environment for PyTorch with CUDA 12.8 and Python 3.11, ready to train LLMs on Runpod. From choosing the right base image to deploying on Secure or Community Cloud with persistent storage, each step is tuned for performance and reproducibility. For more details or troubleshooting, refer to Runpod’s documentation on Dockerfiles, network volumes, and GPU usage to further optimize your workflow. Happy training!

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.

You’ve unlocked a
referral bonus!

Sign up today and you’ll get a random credit bonus between $5 and $500 when you spend your first $10 on Runpod.