What does "serverless inference" actually mean?
At its purest, serverless should be simple: you send us a request, and we respond with a prediction. That's the whole contract. Anything more complicated than that probably isn't something you should have to figure out in your first five minutes on a platform.
The second half of the promise is the auto-scaling guarantee. As your traffic grows, Runpod backs it up with more physical GPUs to process the workload, and when traffic drops, your endpoint scales back down to zero so you're not paying for idle hardware. You're billed per second of actual compute, not for capacity you reserved and never used. This is the key differentiator from a pod, where you just rent the entire GPU and you’re billed for the entire time whether you use it or not.
That simple request/response API plus automatic scaling on real GPUs is what makes serverless the fastest path from "I have a model" to "I have a production endpoint." Today more than 750,000 developers build on Runpod, with serverless deployed across 31 regions and billions of inference requests served. Everything below is about making that path faster, cheaper, and able to handle bigger models.
More GPUs, more ways to use them
We're in a historic GPU supply crunch, layered on top of an industry-wide shift from Nvidia's Ada and Hopper generations to Blackwell. We've invested heavily to keep capacity ahead of demand, which has helped us build our infrastructure out to the level it needs to be. But raw capacity isn't the only answer: a lot of the recent work has been about using each GPU more efficiently.
Multi-Instance GPU (MIG): stop paying for compute you don't need
A card like the RTX Pro 6000 is powerful, and expensive, in the $10,000–$12,000 range and sports 96 GB of VRAM. The catch is that plenty of workloads don't need all of it. If you're serving a 30-billion-parameter model that fits comfortably in ~30 GB, the rest of that card is just sitting there. The question is, then, how do we get that hardware in the hands of developers in a way that’s priced fairly and appropriately for what they need.
Multi-Instance GPU lets us split a single professional or enterprise Nvidia card into firmware-isolated instances. We can carve that one card into separate tenancies in that you run your model on one slice, and another customer you'll never see runs on the other slice, completely separated, the same way Runpod's container images are isolated from one another.
MIG is an Nvidia capability, not something we invented, and with Blackwell it's become effectively lossless. We can take advantage of it without giving up performance inside the chip. For us, it means we can dynamically configure and reconfigure supply to match real demand throughout the day: more 24 GB, 32 GB, 48 GB, or full 96 GB instances as workloads shift between US peak hours, EU peak hours, and the very different mix that runs overnight. For you, it means right-sized GPUs at a price that reflects what you actually use. This is going to be especially important as the industry shifts away from producing the smaller consumer or prosumer cards that these smaller workloads used to run on. That workload demand hasn’t dried up even though GPU production has shifted gears.
Multi-GPU workers for frontier-scale open models
At the other end of the spectrum are the big open-weight models like Kimi K2 or MiniMax 3 that approximate frontier performance and need serious memory, especially at 16-bit precision. For those, a single GPU isn't enough.
Runpod Serverless lets you put up to 8 GPUs behind a single worker, using cards like the H100, H200, B200, and B300 where you get 288 GB of memory on a single GPU. This lets you pull together the terabytes of memory that you’ll need to get one of these models off the ground. Keeping all that compute on one physical system means you can run large models without stepping into truly distributed, multi-node inference and the platform complexity that comes with it. Walk before you run.
Killing cold starts
Cold starts are the perennial serverless concern, and they've only gotten harder as models grow. Serving a model means moving gigabytes across the wire, onto the host disk, into system RAM, and finally into VRAM. We've instrumented that entire pipeline ruthlessly, with engineers whose entire job is shrinking each stage. A few of the technologies doing the heavy lifting:
- FlashBoot is our cold-start optimization layer, delivering sub-200ms cold starts on active endpoints at no extra cost. It can page what's in VRAM back to system RAM or save it to NVMe flash storage, which is especially powerful for intermittent workloads. We automatically page things in and out as traffic comes and goes. The more popular your endpoint, the more FlashBoot helps.
- Model caching in network storage lets you download a model once so workers don't re-download weights on every scale-up.
- A container registry that we’re working on puts Runpod in control of distributing the containers and models you depend on, keeping them cached closer to where your workers run, and letting us inject your code into a longer-lived container image instead of rebuilding from scratch on every change.
Two ways to run: real-time and batch
Not every workload has a user tapping their foot. Runpod Serverless supports two distinct patterns, and choosing the right one is mostly about who (or what) is waiting on the other end.
Near-real-time, user-in-the-loop: Someone is waiting for a response right now like a chat assistant, or an image generation, or a recommendation. We've all learned to tolerate a few seconds, but the expectation is "soon." We all know how fickle users can be and how quickly they are to go to a competitor if something they’re waiting on takes too long to run. This is what most people picture when they think of an inference API.
Batch inference: Here you have a million things to process and no single user waiting. Picture a real-estate platform that wants to auto-HDR every image across 8 million listings. That's a pure throughput job. With batch inference, instead of calling your endpoint as run-sync or run-async, you call it as run-batch: you get a service-level agreement measured in hours, and Runpod schedules the work as capacity frees up. You don't manage workers at all. We might spot 100 idle H100s and run your job there, or slow-burn 8 million requests across a free 8-GPU worker overnight and tell you in the morning what completed and what it cost, billed at the rate you agreed to.
The practical win: batch lets us say yes to large throughput jobs without you needing to negotiate capacity in advance. Deploy without the Docker tax
One of the most consistent pieces of feedback we hear is that the serverless experience isn't quite as frictionless as working in a pod. Writing a Dockerfile, building an image, pushing to a registry, and wiring it all to an endpoint can turn a quick experiment into a multi-hour chore.
Flash removes that tax. Not to be confused with FlashBoot (which is our cold-start optimization layer), Flash is an open-source Python SDK that turns a decorated function into a live, auto-scaling serverless endpoint. No Dockerfile, no container orchestration, just Python. You specify the GPU type and dependencies in a decorator, run your script, and Flash provisions the endpoint, runs your code on the hardware you asked for, and tears it down when the job finishes. You're billed only for execution time, on the same per-second, scale-to-zero economics as the rest of Serverless. Flash is also built for agentic workloads: Flash Apps combine multiple endpoints with different compute configurations into one deployable service, so an agent's orchestration layer can run on one tier while model inference runs on another. We're also making Flash a first-class deployment path into Serverless where you can push Python code changes straight onto a container without rebuilding the image every time, plus a pre-flight check that confirms your build can actually serve inference before you're ever charged for production traffic.
Smarter defaults: model-first deployment and auto-tuning
Most serverless customers bring their own custom AI image, but plenty of people just want to serve a known model well without becoming an inference-engine expert. So we're shipping a model-first deployment flow with pre-tuned configurations that our own ML-ops engineers have benchmarked,
The roadmap goes further: generating configurations tailored to your request shape with input token sizes, output token sizes, latency-vs-throughput goals, and ultimately auto-tuning that we can prove. We can spin up an extra GPU worker, test several configs head-to-head, and show you which one is fastest for your workload. With a finite number of GPUs in the world, density and efficiency aren't just nice to have; they're how everyone gets a better experience.
Storage that follows you
Storage on Runpod is tiered. Container and volume disks live physically on the host. Network storage is a high-performance NVMe storage server co-located in the data center, giving you portability across that whole region, which is where your cached models live. And the newest tier, global storage, uses CDN-style technology so your data isn't pigeonholed inside a single data center.
The payoff: if your workload moves from one data center to another, you don't copy anything. You bind your training data, output data, or anything else your service needs to global storage, and it's simply there, letting you use any GPU in any region without thinking about where the bytes are.
Getting started: deploy your first endpoint
There are two paths, depending on how much control you want over the environment.
Path 1: The classic handler (full control)
Every Serverless endpoint runs a handler function that takes a request and returns a result. The minimal version looks like this:
import runpod
def handler(job):
job_input = job["input"]
prompt = job_input.get("prompt", "")
# Your inference code goes here
result = run_model(prompt)
return result
# Start the serverless worker
runpod.serverless.start({"handler": handler})
From there, the workflow is:
- Write your handler and test it locally with the Runpod SDK (pip install runpod).
- Package it in a container image and push it to a registry.
- Create an endpoint in the Runpod console. Pick your GPU (set a priority list so it falls back automatically if your first choice is busy), set min/max workers, and enable FlashBoot to minimize cold starts.
- Send a request to your endpoint URL via run-sync, run-async, or run-batch.
Keeping one or two active workers above zero eliminates cold starts for your baseline traffic, while a max-worker cap protects you from runaway spend during spikes.
Path 2: Flash (no Docker, fastest to first request)
If you'd rather skip containers entirely:
pip install runpod-flash
Then decorate a Python function with the GPU and dependencies it needs, run your script, and Flash deploys it as a real serverless endpoint behind the scenes. Because the exact decorator syntax evolves, grab a copy-paste-ready starting point from the official examples — they cover everything from a hello-world GPU script to CPU workers, mixed-worker configs, and load-balanced endpoints with custom HTTP routes:
- Source: github.com/runpod/flash
- Examples: github.com/runpod/flash-examples
- Docs: docs.runpod.io/serverless/overview
Either way, you're on the same auto-scaling, per-second, scale-to-zero platform — start simple, and reach for multi-GPU workers, batch inference, or MIG as your needs grow.
Ready do deploy your first endpoint?
The fastest path in is Flash: no Docker, no container registry. Install the CLI, write your handler, and deploy. Your endpoint is live in minutes.
If you want full control over the container environment, the classic handler gives you that. Both approaches run on the same Runpod Serverless infrastructure, with the same autoscaling and per-second billing.
Frequently asked questions
What's the difference between FlashBoot and Flash?
FlashBoot is the cold-start optimization layer that gets pre-warmed workers responding in under 200ms. Flash is the open-source Python SDK that lets you deploy code to Serverless without building a Docker image. Different tools with a similar name. FlashBoot makes endpoints fast, Flash makes them fast to ship.
How does Runpod handle cold starts?
Through a combination of FlashBoot (paging models between VRAM, system RAM, and NVMe), model caching in network storage so weights aren't re-downloaded, and a forthcoming container registry that keeps your containers cached closer to your workers.
When should I use batch inference instead of real-time?
Use real-time when a user is waiting on the response. Use batch when you have a large volume of work and a few hours of headroom. Runpod schedules it against available capacity and bills at the rate you agreed to, with no worker management on your end.
Can I run large open-weight models?
Yes. You can put up to 8 GPUs behind a single worker using H100, H200, B200, or B300 cards, which is enough to serve frontier-scale open models like Kimi K2 or MiniMax 2 on a single physical system.
Do I have to manage scaling?
No. Endpoints scale from zero to as many workers as your traffic needs and back down automatically. You set the min/max bounds; Runpod handles the rest.