Back
News
April 30th, 2026
10
minute read

Announcing Runpod Flash

Brendan McKeag
Brendan McKeag

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

When we launched Flash in beta back in March, we made one wager: that the most painful part of serverless GPU development was Docker, and that Python developers would get a lot done if we took it off their plates. The response was clear, and the feedback was loud, specific, and incredibly useful. Today, we are shipping Flash GA — and it's an even better tool than the one we put out a couple months ago.

The pitch hasn't changed: you write a Python function, decorate it, and run it. That's it.

import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(name="hello-gpu", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"])
async def hello():
    import torch
    gpu_name = torch.cuda.get_device_name(0)
    return {"gpu": gpu_name}

print(asyncio.run(hello()))

That's the full program. Flash provisions a GPU on Runpod Serverless, installs your dependencies, executes the function on the remote hardware, and returns the result to your terminal. If you followed the beta, here's what's new in GA.

A cleaner, more honest API

The beta exposed Flash as @remote plus a separate LiveServerless config object. It worked, but it forced you to think about Runpod's resource model before you'd written a single line of business logic. In GA, the decorator is @Endpoint, and configuration lives directly on it:

@Endpoint(
    name="image-processor",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    workers=(0, 5),
    dependencies=["torch", "pillow"]
)
async def process_image(image_data):

GPUs are now selectable through a typed enum (GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_GEFORCE_RTX_4090, and so on), or through groups when you want flexibility (GpuGroup.ANY, GpuGroup.AMPERE_80). Pass a list and Flash will use any of them — extremely useful for dodging capacity constraints on a specific SKU.

CPU endpoints work the same way with cpu="cpu5c-4-8" or CpuInstanceType.CPU5C_4_8.

Four endpoint patterns, one class

The Endpoint class now covers every shape of serverless workload you'd actually want to build:

1. Queue-based (@Endpoint(...) as a decorator) for batch jobs and async work; the original Flash pattern.

2. Load-balanced (Endpoint(...) as an instance with route methods) for low-latency HTTP APIs. Multiple routes share the same workers, no queue overhead:

api = Endpoint(name="inference-api", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, workers=(1, 5))

@api.post("/predict")
async def predict(data): ...

@api.get("/health")
async def health(): ...

3. Custom Docker images for cases where you've already got a worker built — vLLM, TensorRT-LLM, ComfyUI, your own image. Pass image= instead of dependencies= and you can hit it like an HTTP service:

vllm = Endpoint(
    name="vllm-server",
    image="runpod/worker-vllm:stable-cuda12.1.0",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    env={"MODEL_NAME": "microsoft/Phi-3.5-mini-instruct"}
)

result = await vllm.post("/v1/completions", {"prompt": "Hello"})

4. Existing endpoints by ID, so you can hit anything you've already deployed on Runpod from a Flash script:

ep = Endpoint(id="abc123")
job = await ep.run({"prompt": "hello"})
await job.wait()
print(job.output)

That last pattern means Flash isn't only a deployment tool. It's also the most ergonomic Python client we've ever shipped for Runpod Serverless.

flash build and flash deploy: from script to production

The biggest change in GA is that Flash now produces real, production deployments, not just live-test endpoints.

flash deploy scans your project for @Endpoint-decorated functions, groups them by configuration, installs your Python dependencies into the worker image, and bundles everything into a deployable artifact. Dependencies are shipped as an artifact and mounted at runtime, which cuts cold starts substantially.

The build is cross-platform out of the box. You can develop on an M-series Mac and Flash will still produce a Linux x86_64 artifact that runs cleanly on Runpod's serverless fleet. It enforces binary wheels and matches your Python version automatically — no more "well, it works on my PC; have you tried turning it off and on again?" surprises at deploy time.

There's also a 500MB deployment limit to be aware of. If you're using a GPU base image, PyTorch is pre-installed, and you can shave size off your bundle with:

flash deploy --exclude torch,torchvision,torchaudio

Cross-endpoint function calls

Inside a Flash app, functions on different endpoints can call each other directly. Flash's runtime uses the build manifest to handle service discovery and routing without any extra wiring:

@Endpoint(cpu="cpu5c-4-8")
def preprocess(data):
    return clean_data

@Endpoint(gpu=GpuType.NVIDIA_A100_80GB_PCIe)
async def inference(data):
    clean = preprocess(data)   # CPU endpoint
    return run_model(clean)    # this endpoint

This makes hybrid CPU/GPU pipelines — preprocess on cheap CPU workers, then run inference on a big GPU — almost trivially easy. Each function lives where it makes sense, and Flash glues the calls together.

Better local development

A few smaller changes that add up to a much nicer dev loop:

flash login opens your browser, authorizes once, and saves credentials securely. You can still use RUNPOD_API_KEY if you prefer.

flash dev --auto-provision spins up every endpoint in your project upfront so the first request to each one isn't a cold start. Endpoints get cached and reused across server restarts, identified by name and config.

flash undeploy finally exists. List your Flash endpoints, remove a single one, or wipe them all:

flash undeploy list
flash undeploy my-endpoint
flash undeploy --all

Coding agent integration. If you're using Claude Code, Cursor, or Cline, npx skills add runpod/skills installs a Flash skill package that gives your agent detailed context about the SDK. The agent stops hallucinating decorator syntax and starts writing code that actually runs.

Persistent storage, environment variables, and other production niceties

NetworkVolume is a first-class object now, with proper multi-datacenter suppot:

from runpod_flash import Endpoint, GpuGroup, DataCenter, NetworkVolume

volumes = [
    NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_2),
    NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
]

@Endpoint(
    name="global-server",
    gpu=GpuGroup.ANY,
    datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1],
    volume=volumes
)
async def serve(data):
    # files at /runpod-volume/
    ...

Files mount at /runpod-volume/, which is perfect for caching a model once and reusing it across cold starts.

Environment variables passed via env= are now excluded from the configuration hash. Rotating an API key or flipping a feature flag won't trigger an endpoint rebuild — only structural changes (GPU type, image, container disk, etc.) do.

system_dependencies lets you install apt packages alongside your pip dependencies for libraries like OpenCV that need system-level support.

GPU endpoints can deploy across multiple datacenters. CPU endpoints are still EU-RO-1 only for now; that'll expand soon.

The EndpointJob API

When you submit work to a queue-based endpoint or a custom-image endpoint, you get an EndpointJob back with everything you'd want for managing async work:

job = await endpoint.run({"input": {"prompt": "hello"}})

await job.status()          # IN_PROGRESS, COMPLETED, FAILED
await job.wait(timeout=60)  # block until done
job.output                  # result payload
job.error                   # error if failed
job.done                    # boolean
await job.cancel()          # cancel a running job

This is the same set of operations you'd build by hand against the Runpod Serverless API — just exposed as a clean Python object.

Get started

Flash is open source, on PyPI, and ready to use today.

pip install runpod-flash
flash login

The examples repository covers everything from hello-world GPU scripts through multi-stage CPU/GPU pipelines, vLLM deployments, custom HTTP APIs, and full Flash apps with persistent storage. If you tried Flash in beta and bounced off something, give it another look — there's a real chance the thing that frustrated you is fixed.

We're excited to see what you build.

Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Runpod × OpenAI: Parameter Golf challenge is live
You've unlocked a referral bonus! Sign up today and you'll get a random credit bonus between $5 and $500
You've unlocked a referral bonus!
Claim Your Bonus
Claim Bonus
Blog

Announcing Runpod Flash

Flash is now generally available (GA) as a production-ready tool for running serverless GPU and CPU workloads in pure Python without needing Docker.

Author
Brendan McKeag
Date
April 30, 2026
Table of contents
Share
Announcing Runpod Flash

When we launched Flash in beta back in March, we made one wager: that the most painful part of serverless GPU development was Docker, and that Python developers would get a lot done if we took it off their plates. The response was clear, and the feedback was loud, specific, and incredibly useful. Today, we are shipping Flash GA — and it's an even better tool than the one we put out a couple months ago.

The pitch hasn't changed: you write a Python function, decorate it, and run it. That's it.

import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(name="hello-gpu", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"])
async def hello():
    import torch
    gpu_name = torch.cuda.get_device_name(0)
    return {"gpu": gpu_name}

print(asyncio.run(hello()))

That's the full program. Flash provisions a GPU on Runpod Serverless, installs your dependencies, executes the function on the remote hardware, and returns the result to your terminal. If you followed the beta, here's what's new in GA.

A cleaner, more honest API

The beta exposed Flash as @remote plus a separate LiveServerless config object. It worked, but it forced you to think about Runpod's resource model before you'd written a single line of business logic. In GA, the decorator is @Endpoint, and configuration lives directly on it:

@Endpoint(
    name="image-processor",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    workers=(0, 5),
    dependencies=["torch", "pillow"]
)
async def process_image(image_data):

GPUs are now selectable through a typed enum (GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_GEFORCE_RTX_4090, and so on), or through groups when you want flexibility (GpuGroup.ANY, GpuGroup.AMPERE_80). Pass a list and Flash will use any of them — extremely useful for dodging capacity constraints on a specific SKU.

CPU endpoints work the same way with cpu="cpu5c-4-8" or CpuInstanceType.CPU5C_4_8.

Four endpoint patterns, one class

The Endpoint class now covers every shape of serverless workload you'd actually want to build:

1. Queue-based (@Endpoint(...) as a decorator) for batch jobs and async work; the original Flash pattern.

2. Load-balanced (Endpoint(...) as an instance with route methods) for low-latency HTTP APIs. Multiple routes share the same workers, no queue overhead:

api = Endpoint(name="inference-api", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, workers=(1, 5))

@api.post("/predict")
async def predict(data): ...

@api.get("/health")
async def health(): ...

3. Custom Docker images for cases where you've already got a worker built — vLLM, TensorRT-LLM, ComfyUI, your own image. Pass image= instead of dependencies= and you can hit it like an HTTP service:

vllm = Endpoint(
    name="vllm-server",
    image="runpod/worker-vllm:stable-cuda12.1.0",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    env={"MODEL_NAME": "microsoft/Phi-3.5-mini-instruct"}
)

result = await vllm.post("/v1/completions", {"prompt": "Hello"})

4. Existing endpoints by ID, so you can hit anything you've already deployed on Runpod from a Flash script:

ep = Endpoint(id="abc123")
job = await ep.run({"prompt": "hello"})
await job.wait()
print(job.output)

That last pattern means Flash isn't only a deployment tool. It's also the most ergonomic Python client we've ever shipped for Runpod Serverless.

flash build and flash deploy: from script to production

The biggest change in GA is that Flash now produces real, production deployments, not just live-test endpoints.

flash deploy scans your project for @Endpoint-decorated functions, groups them by configuration, installs your Python dependencies into the worker image, and bundles everything into a deployable artifact. Dependencies are shipped as an artifact and mounted at runtime, which cuts cold starts substantially.

The build is cross-platform out of the box. You can develop on an M-series Mac and Flash will still produce a Linux x86_64 artifact that runs cleanly on Runpod's serverless fleet. It enforces binary wheels and matches your Python version automatically — no more "well, it works on my PC; have you tried turning it off and on again?" surprises at deploy time.

There's also a 500MB deployment limit to be aware of. If you're using a GPU base image, PyTorch is pre-installed, and you can shave size off your bundle with:

flash deploy --exclude torch,torchvision,torchaudio

Cross-endpoint function calls

Inside a Flash app, functions on different endpoints can call each other directly. Flash's runtime uses the build manifest to handle service discovery and routing without any extra wiring:

@Endpoint(cpu="cpu5c-4-8")
def preprocess(data):
    return clean_data

@Endpoint(gpu=GpuType.NVIDIA_A100_80GB_PCIe)
async def inference(data):
    clean = preprocess(data)   # CPU endpoint
    return run_model(clean)    # this endpoint

This makes hybrid CPU/GPU pipelines — preprocess on cheap CPU workers, then run inference on a big GPU — almost trivially easy. Each function lives where it makes sense, and Flash glues the calls together.

Better local development

A few smaller changes that add up to a much nicer dev loop:

flash login opens your browser, authorizes once, and saves credentials securely. You can still use RUNPOD_API_KEY if you prefer.

flash dev --auto-provision spins up every endpoint in your project upfront so the first request to each one isn't a cold start. Endpoints get cached and reused across server restarts, identified by name and config.

flash undeploy finally exists. List your Flash endpoints, remove a single one, or wipe them all:

flash undeploy list
flash undeploy my-endpoint
flash undeploy --all

Coding agent integration. If you're using Claude Code, Cursor, or Cline, npx skills add runpod/skills installs a Flash skill package that gives your agent detailed context about the SDK. The agent stops hallucinating decorator syntax and starts writing code that actually runs.

Persistent storage, environment variables, and other production niceties

NetworkVolume is a first-class object now, with proper multi-datacenter suppot:

from runpod_flash import Endpoint, GpuGroup, DataCenter, NetworkVolume

volumes = [
    NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_2),
    NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
]

@Endpoint(
    name="global-server",
    gpu=GpuGroup.ANY,
    datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1],
    volume=volumes
)
async def serve(data):
    # files at /runpod-volume/
    ...

Files mount at /runpod-volume/, which is perfect for caching a model once and reusing it across cold starts.

Environment variables passed via env= are now excluded from the configuration hash. Rotating an API key or flipping a feature flag won't trigger an endpoint rebuild — only structural changes (GPU type, image, container disk, etc.) do.

system_dependencies lets you install apt packages alongside your pip dependencies for libraries like OpenCV that need system-level support.

GPU endpoints can deploy across multiple datacenters. CPU endpoints are still EU-RO-1 only for now; that'll expand soon.

The EndpointJob API

When you submit work to a queue-based endpoint or a custom-image endpoint, you get an EndpointJob back with everything you'd want for managing async work:

job = await endpoint.run({"input": {"prompt": "hello"}})

await job.status()          # IN_PROGRESS, COMPLETED, FAILED
await job.wait(timeout=60)  # block until done
job.output                  # result payload
job.error                   # error if failed
job.done                    # boolean
await job.cancel()          # cancel a running job

This is the same set of operations you'd build by hand against the Runpod Serverless API — just exposed as a clean Python object.

Get started

Flash is open source, on PyPI, and ready to use today.

pip install runpod-flash
flash login

The examples repository covers everything from hello-world GPU scripts through multi-stage CPU/GPU pipelines, vLLM deployments, custom HTTP APIs, and full Flash apps with persistent storage. If you tried Flash in beta and bounced off something, give it another look — there's a real chance the thing that frustrated you is fixed.

We're excited to see what you build.

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.

You’ve unlocked a
referral bonus!

Sign up today and you’ll get a random credit bonus between $5 and $500 when you spend your first $10 on Runpod.