Back
News
March 6th, 2026
minute read

Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required

Brendan McKeag
Brendan McKeag

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

One of the most common pieces of feedback we hear from developers building on Serverless is that the experience isn't as frictionless as working in a pod. Between writing Dockerfiles, building images, pushing to a registry, and wiring everything up to an endpoint, the overhead can turn a quick experiment into a multi-hour chore. Today, we're changing that with Flash: an application framework that lets you deploy GPU-accelerated Python functions to Runpod Serverless with a single decorator.

No Docker. No container orchestration. Just Python.

What is Flash?

Flash is a Python SDK for distributed inference and orchestration on Runpod's serverless infrastructure. You write functions locally, decorate them with @remote, and Flash takes care of everything else: provisioning the serverless endpoint, selecting the GPU hardware you specify, installing your dependencies, and returning the results. Docker still runs under the hood to execute your code, but you never need to touch it.

The only things you need to get started are a RunPod account with a balance, an API key, and the runpod-flash package installed on your local machine.

pip install runpod-flash
export RUNPOD_API_KEY=[YOUR_API_KEY]

How It Works

The workflow is straightforward. You define a Python function with the @endpoint decorator, specifying the GPU type, worker count, and any pip dependencies your code needs. When you run the script, Flash silently creates and manages a serverless endpoint behind the scenes — you'll even see it appear in the RunPod console if you want to inspect it. Your function executes on the remote hardware and the results come back to your terminal.

Here's a minimal example that runs a matrix multiplication on a remote GPU:

from runpod_flash import endpoint
import torch

@endpoint(
    name="matrix-multiply",
    gpu="RTX4090",
    workers=[1, 3],
    dependencies=["torch"]
)
async def multiply(request):
    a = torch.randn(1000, 1000, device="cuda")
    b = torch.randn(1000, 1000, device="cuda")
    result = torch.matmul(a, b)
    return {"shape": list(result.shape), "sample": result[0][:5].tolist()}

Build Production APIs with Flash and FastAPI

Flash isn't just for standalone scripts. You can pair it with FastAPI to build full production APIs in under 50 lines of code. The pattern is clean: define your Flash-decorated worker functions in one file, mount a FastAPI router in another, and use flash run to serve the whole thing.

# endpoint.py
from runpod_flash import endpoint
import torch
import numpy as np

@endpoint(
    name="matrix-service",
    gpu=["ADA_24", "AMPERE_48"],
    workers=[1, 3],
    dependencies=["torch", "numpy"]
)
async def compute(request):
    size = request.get("size", 1024)

    # Generate random matrices and multiply on GPU
    a = torch.randn(size, size, device="cuda")
    b = torch.randn(size, size, device="cuda")
    result = torch.matmul(a, b)

    return {
        "gpu": torch.cuda.get_device_name(0),
        "matrix_size": f"{size}x{size}",
        "result_sample": result[0][:5].cpu().tolist(),
        "result_mean": result.mean().cpu().item(),
    }
# main.py
from fastapi import FastAPI
from endpoint import router

app = FastAPI()
app.include_router(router)

Then just run:

flash run

In a second terminal, send a request:

curl -X POST http://localhost:8888/compute \
  -H "Content-Type: application/json" \
  -d '{"size": 2048}'

And you'll get a JSON response back from the GPU:

{
  "gpu": "NVIDIA GeForce RTX 4090",
  "matrix_size": "2048x2048",
  "result_sample": [12.34, -8.91, 45.67, -23.12, 7.56],
  "result_mean": 0.0023
}

That's a full production API — decorator, router, and live GPU inference — in under 50 lines.

Why Flash Matters

We built Flash because we believe the barrier to entry for serverless GPU computing should be as low as possible. Docker is powerful, but for many developers. Especially those prototyping, experimenting, or iterating quickly; it adds friction that slows down the creative loop.

Flash changes the economics of development too. Instead of keeping a pod running while you code and test, you can send requests to a serverless endpoint only when you're ready. You pay for compute when you use it, not while you're thinking.

And because Flash endpoints are real Runpod serverless endpoints, you get all the production benefits: autoscaling, cold start management, and GPU availability across our full fleet.

Get Started

Flash is open source and ready to use today.

The examples repository includes walkthroughs covering everything from hello-world GPU scripts to CPU workers, mixed worker configurations, dependency management, and load-balanced endpoints with custom HTTP routes.

We're excited to make serverless development feel as natural as writing local Python. Give Flash a try, and let us know what you build.

Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
Newly  Features

We've cooked up a bunch of improvements designed to reduce friction and make the.

Create ->
We're officially HIPAA & GDPR compliant
You've unlocked a referral bonus! Sign up today and you'll get a random credit bonus between $5 and $500
You've unlocked a referral bonus!
Claim Your Bonus
Claim Bonus
Blog

Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required

We've just released a way to run Serverless code without needing to build a Docker image: check it out.

Author
Brendan McKeag
Date
March 6, 2026
Table of contents
Share
Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required

One of the most common pieces of feedback we hear from developers building on Serverless is that the experience isn't as frictionless as working in a pod. Between writing Dockerfiles, building images, pushing to a registry, and wiring everything up to an endpoint, the overhead can turn a quick experiment into a multi-hour chore. Today, we're changing that with Flash: an application framework that lets you deploy GPU-accelerated Python functions to Runpod Serverless with a single decorator.

No Docker. No container orchestration. Just Python.

What is Flash?

Flash is a Python SDK for distributed inference and orchestration on Runpod's serverless infrastructure. You write functions locally, decorate them with @remote, and Flash takes care of everything else: provisioning the serverless endpoint, selecting the GPU hardware you specify, installing your dependencies, and returning the results. Docker still runs under the hood to execute your code, but you never need to touch it.

The only things you need to get started are a RunPod account with a balance, an API key, and the runpod-flash package installed on your local machine.

pip install runpod-flash
export RUNPOD_API_KEY=[YOUR_API_KEY]

How It Works

The workflow is straightforward. You define a Python function with the @endpoint decorator, specifying the GPU type, worker count, and any pip dependencies your code needs. When you run the script, Flash silently creates and manages a serverless endpoint behind the scenes — you'll even see it appear in the RunPod console if you want to inspect it. Your function executes on the remote hardware and the results come back to your terminal.

Here's a minimal example that runs a matrix multiplication on a remote GPU:

from runpod_flash import endpoint
import torch

@endpoint(
    name="matrix-multiply",
    gpu="RTX4090",
    workers=[1, 3],
    dependencies=["torch"]
)
async def multiply(request):
    a = torch.randn(1000, 1000, device="cuda")
    b = torch.randn(1000, 1000, device="cuda")
    result = torch.matmul(a, b)
    return {"shape": list(result.shape), "sample": result[0][:5].tolist()}

Build Production APIs with Flash and FastAPI

Flash isn't just for standalone scripts. You can pair it with FastAPI to build full production APIs in under 50 lines of code. The pattern is clean: define your Flash-decorated worker functions in one file, mount a FastAPI router in another, and use flash run to serve the whole thing.

# endpoint.py
from runpod_flash import endpoint
import torch
import numpy as np

@endpoint(
    name="matrix-service",
    gpu=["ADA_24", "AMPERE_48"],
    workers=[1, 3],
    dependencies=["torch", "numpy"]
)
async def compute(request):
    size = request.get("size", 1024)

    # Generate random matrices and multiply on GPU
    a = torch.randn(size, size, device="cuda")
    b = torch.randn(size, size, device="cuda")
    result = torch.matmul(a, b)

    return {
        "gpu": torch.cuda.get_device_name(0),
        "matrix_size": f"{size}x{size}",
        "result_sample": result[0][:5].cpu().tolist(),
        "result_mean": result.mean().cpu().item(),
    }
# main.py
from fastapi import FastAPI
from endpoint import router

app = FastAPI()
app.include_router(router)

Then just run:

flash run

In a second terminal, send a request:

curl -X POST http://localhost:8888/compute \
  -H "Content-Type: application/json" \
  -d '{"size": 2048}'

And you'll get a JSON response back from the GPU:

{
  "gpu": "NVIDIA GeForce RTX 4090",
  "matrix_size": "2048x2048",
  "result_sample": [12.34, -8.91, 45.67, -23.12, 7.56],
  "result_mean": 0.0023
}

That's a full production API — decorator, router, and live GPU inference — in under 50 lines.

Why Flash Matters

We built Flash because we believe the barrier to entry for serverless GPU computing should be as low as possible. Docker is powerful, but for many developers. Especially those prototyping, experimenting, or iterating quickly; it adds friction that slows down the creative loop.

Flash changes the economics of development too. Instead of keeping a pod running while you code and test, you can send requests to a serverless endpoint only when you're ready. You pay for compute when you use it, not while you're thinking.

And because Flash endpoints are real Runpod serverless endpoints, you get all the production benefits: autoscaling, cold start management, and GPU availability across our full fleet.

Get Started

Flash is open source and ready to use today.

The examples repository includes walkthroughs covering everything from hello-world GPU scripts to CPU workers, mixed worker configurations, dependency management, and load-balanced endpoints with custom HTTP routes.

We're excited to make serverless development feel as natural as writing local Python. Give Flash a try, and let us know what you build.

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.

You’ve unlocked a
referral bonus!

Sign up today and you’ll get a random credit bonus between $5 and $500 when you spend your first $10 on Runpod.