Deploy FLUX.2 on Runpod: GPU Requirements, ComfyUI, and Production Setup

Quick answer: FLUX.2-dev runs best on a single H100 or A100 80GB pod using the FP8 checkpoint (~32 GB VRAM), with an A40 or another 48 GB card capable of running quantized dev workflows for testing and experimentation. For sub-second generation on cheaper hardware, deploy FLUX.2 klein 9B on an L40S or RTX 4090 pod. Spin up a ComfyUI or PyTorch pod template, pull the weights from Hugging Face, and you can be generating 4 MP images within minutes, paying per second only while the pod is running.

Black Forest Labs released the FLUX.2 family in November 2025, and it has since replaced FLUX.1 as the default choice for production image generation. FLUX.2 is a 32-billion-parameter rectified flow transformer paired with a Mistral-based 24B vision-language text encoder. The practical gains over FLUX.1 are the ones teams actually notice in output: reliable in-image text rendering, resolutions up to 4 megapixels, multi-reference conditioning (up to 10 reference images for consistent characters and styles), and built-in editing in the same model.

The catch is hardware. At full BF16 precision, FLUX.2-dev needs roughly 64 GB of VRAM before text-encoder overhead, out of reach for any consumer card. That makes cloud GPUs the realistic deployment path for most teams, and it's where Runpod's per-second billing and high-VRAM inventory fit.

Which FLUX.2 variant should you deploy?

Rule of thumb: if output quality and text accuracy are the priority (product imagery, ads, anything with typography), deploy dev in FP8. If you're serving a high-volume API where latency and cost-per-image dominate, the klein models, step-distilled to 4 inference steps, are the better fit and need far less VRAM.

What about FLUX.1? (Still searching for the original?)

FLUX.1, the model that put Black Forest Labs on the map in August 2024, followed by FLUX 1.1 Pro and the Kontext editing models in 2025, is still available and still widely deployed. If you arrived here looking for a FLUX.1 setup, the good news is that everything below applies with lighter hardware: FLUX.1-dev needs roughly 33 GB of VRAM at FP16 and about 13 GB at FP8, so a single L40S, RTX A6000, or even an RTX 4090 pod on Runpod handles it comfortably, and FLUX.1-schnell runs on cards as small as 8 GB for casual generation. That makes FLUX.1 a sensible budget option for workflows that don't need FLUX.2's text rendering or 4 MP output, existing LoRAs, fine-tunes, and ComfyUI workflows built on FLUX.1 also remain fully usable. For new projects, though, FLUX.2 klein 4B now occupies the same low-VRAM niche with newer architecture and an Apache 2.0 license, so we'd recommend FLUX.1 mainly when you're maintaining an existing pipeline or ecosystem of trained adapters rather than starting fresh.

Matching the model to a Runpod GPU

H100 80GB, the workhorse for FLUX.2-dev FP8. The 32 GB checkpoint fits with ample headroom for the text encoder, VAE, and batch generation at 4 MP.
A100 80GB, runs dev comfortably in FP8; workable in BF16 with text-encoder offloading. Often the best price-per-image for steady production load.
L40S / RTX 6000 Ada (48 GB), runs dev with GGUF Q8 quantization, or klein 9B at full precision with room to spare.
RTX 4090 / 5090 (24–32 GB), klein 9B in FP8, klein 4B natively, or dev via GGUF Q4 (~19GB) if you accept a modest quality trade-off.

NVIDIA and Black Forest Labs shipped FP8 quantizations at launch that cut VRAM by about 40% at comparable quality, and they're supported out of the box in ComfyUI. So FP8, not full precision, is the sensible default for self-hosting dev.

Step-by-step: deploying FLUX.2-dev on Runpod

Create a pod. In the Runpod console, select an H100 or A100 80GB (Secure Cloud for production. This ensures that you have enough memory to experiment; once you have a better handle on what you actually need then you can drop to a lower spec once your requirements are nailed down. Attach a volume of 100 GB+ as the dev weights alone are large, and you'll want persistent storage so you don't re-download on every pod start.
Pick a template. The fastest route is the official ComfyUI template, which ships with FLUX.2 workflow templates built in as well as the RunpodDirect and Civicomfy plugins to make download and installation easier. For a more technical, API-style deployment, use a PyTorch 2.x base image and install diffusers, which supports FluxPipeline with the FLUX.2-dev checkpoint and integrates cleanly with FastAPI, batching, and torch.compile.
Pull the weights. Download FLUX.2-dev (or a klein variant) from Black Forest Labs' Hugging Face page to your network volume. Use the FP8 checkpoint unless you have a specific reason to run BF16. Be aware that it's a gated model so you'll need to set up a token on Hugging Face as well as agree to the model terms.
Generate. Dev targets 20–50 inference steps (28 is a common default). The Mistral-based encoder responds best to plain descriptive sentences rather than keyword-soup prompts, write the prompt as a short paragraph describing composition, lighting, and any in-image text verbatim.
Scale. For bursty or production traffic, wrap the pipeline as a Runpod Serverless endpoint. You get scale-to-zero between requests and per-second billing during them, which is usually far cheaper than a 24/7 pod for variable workloads. For batch jobs (e.g., hundreds of product mockups), spot instances cut costs further at the price of possible interruption, checkpoint outputs to your volume.

Practical tuning notes

FP8 first, GGUF second. FP8 is near-lossless for dev. GGUF Q8 is comparable but slightly larger in VRAM; Q4/Q5 only when you must fit a 24GB card. Don't stack FP8 compute on top of GGUF weights, it adds instability for negligible savings.
Use multi-reference for consistency. FLUX.2's multi-reference input (up to 10 images) is the right tool for consistent characters or brand styles across a campaign, far more reliable than prompt engineering alone.
Offload the text encoder if you're VRAM-constrained: the 24B encoder can sit in system RAM between prompt encodings with minor latency cost.
klein for iteration, dev for finals. A common workflow is prompt-iterating on klein 4B (sub-second feedback), then re-rendering selects on dev.

FAQ

What GPU do I need for FLUX.2 on Runpod? An H100 or A100 80GB for FLUX.2-dev in FP8 (~32 GB VRAM in use). klein 9B fits 24GB cards in FP8; klein 4B fits 16GB cards. Full-precision dev (~64 GB+) effectively requires an 80GB-class card with offloading.

How fast is generation? klein models are distilled to 4 steps and generate in under a second on appropriate hardware. dev at 28 steps typically lands in the seconds-to-tens-of-seconds range per 1024px image depending on GPU and resolution; 4 MP output takes proportionally longer.

Is FLUX.2 open source? Partially. The klein models are Apache 2.0. FLUX.2-dev is open-weights under a non-commercial license, with commercial licensing available from Black Forest Labs. The pro/max tiers are API-only.

Can I still run FLUX.1? Yes, see the FLUX.1 section above. dev and Schnell remain available, run on much smaller GPUs, and retain a large ecosystem of LoRAs and workflows, but FLUX.2 is materially better on text rendering, prompt adherence, and editing for new builds.

Should I use Pods or Serverless? Pods for sustained, predictable load or interactive ComfyUI sessions; Serverless for APIs with variable traffic, where scale-to-zero eliminates idle cost.

‍

Author profile: Emmett Fear

Articles

View All

Fine-Tuning Mistral Nemo for Multilingual AI Applications on Runpod

Fine-tune Mistral Nemo for multilingual AI on Runpod's A100 GPUs, customize cross-language translation and sentiment models using Dockerized TensorFlow.

Fine-Tuning DeepSeek-Coder V2 for Specialized Coding AI on Runpod

Fine-tune DeepSeek-Coder V2 on Runpod's A100 GPUs to accelerate code generation and debugging, customize multilingual coding models using Dockerized.

Efficient Fine‑Tuning on a Budget: Adapters, Prefix Tuning and IA³ on Runpod

Reduce GPU costs by 70% using parameter-efficient fine-tuning on Runpod, train adapters, LoRA, prefix vectors, and (IA)³ modules on large models like.

Edge AI Deployment: Running GPU-Accelerated Models at the Network Edge

Deploy low-latency, privacy-first AI models at the edge using Runpod, prototype and optimize GPU-accelerated inference on RTX and Jetson-class hardware.

Deploying Yi-1.5 for Vision-Language AI Tasks on Runpod with Docker

Deploy 01.AI's Yi-1.5 on Runpod to power vision-language AI, run image-text fusion tasks like captioning and VQA using A100 GPUs, Dockerized PyTorch.

Deploying Grok-2 for Advanced Conversational AI on Runpod with Docker

Deploy xAI's Grok-2 on Runpod for real-time conversational AI, run witty, multi-turn dialogue at scale using H100 GPUs, Dockerized inference, and.

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.

Get started