We've cooked up a bunch of improvements designed to reduce friction and make the.



Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
One of the most common pieces of feedback we hear from developers building on Serverless is that the experience isn't as frictionless as working in a pod. Between writing Dockerfiles, building images, pushing to a registry, and wiring everything up to an endpoint, the overhead can turn a quick experiment into a multi-hour chore. Today, we're changing that with Flash: an application framework that lets you deploy GPU-accelerated Python functions to Runpod Serverless with a single decorator.
No Docker. No container orchestration. Just Python.
Flash is a Python SDK for distributed inference and orchestration on Runpod's serverless infrastructure. You write functions locally, decorate them with @remote, and Flash takes care of everything else: provisioning the serverless endpoint, selecting the GPU hardware you specify, installing your dependencies, and returning the results. Docker still runs under the hood to execute your code, but you never need to touch it.
The only things you need to get started are a RunPod account with a balance, an API key, and the runpod-flash package installed on your local machine.
pip install runpod-flash
export RUNPOD_API_KEY=[YOUR_API_KEY]The workflow is straightforward. You define a Python function with the @endpoint decorator, specifying the GPU type, worker count, and any pip dependencies your code needs. When you run the script, Flash silently creates and manages a serverless endpoint behind the scenes — you'll even see it appear in the RunPod console if you want to inspect it. Your function executes on the remote hardware and the results come back to your terminal.
Here's a minimal example that runs a matrix multiplication on a remote GPU:
from runpod_flash import endpoint
import torch
@endpoint(
name="matrix-multiply",
gpu="RTX4090",
workers=[1, 3],
dependencies=["torch"]
)
async def multiply(request):
a = torch.randn(1000, 1000, device="cuda")
b = torch.randn(1000, 1000, device="cuda")
result = torch.matmul(a, b)
return {"shape": list(result.shape), "sample": result[0][:5].tolist()}Flash isn't just for standalone scripts. You can pair it with FastAPI to build full production APIs in under 50 lines of code. The pattern is clean: define your Flash-decorated worker functions in one file, mount a FastAPI router in another, and use flash run to serve the whole thing.
# endpoint.py
from runpod_flash import endpoint
import torch
import numpy as np
@endpoint(
name="matrix-service",
gpu=["ADA_24", "AMPERE_48"],
workers=[1, 3],
dependencies=["torch", "numpy"]
)
async def compute(request):
size = request.get("size", 1024)
# Generate random matrices and multiply on GPU
a = torch.randn(size, size, device="cuda")
b = torch.randn(size, size, device="cuda")
result = torch.matmul(a, b)
return {
"gpu": torch.cuda.get_device_name(0),
"matrix_size": f"{size}x{size}",
"result_sample": result[0][:5].cpu().tolist(),
"result_mean": result.mean().cpu().item(),
}# main.py
from fastapi import FastAPI
from endpoint import router
app = FastAPI()
app.include_router(router)Then just run:
flash runIn a second terminal, send a request:
curl -X POST http://localhost:8888/compute \
-H "Content-Type: application/json" \
-d '{"size": 2048}'And you'll get a JSON response back from the GPU:
{
"gpu": "NVIDIA GeForce RTX 4090",
"matrix_size": "2048x2048",
"result_sample": [12.34, -8.91, 45.67, -23.12, 7.56],
"result_mean": 0.0023
}That's a full production API — decorator, router, and live GPU inference — in under 50 lines.
We built Flash because we believe the barrier to entry for serverless GPU computing should be as low as possible. Docker is powerful, but for many developers. Especially those prototyping, experimenting, or iterating quickly; it adds friction that slows down the creative loop.
Flash changes the economics of development too. Instead of keeping a pod running while you code and test, you can send requests to a serverless endpoint only when you're ready. You pay for compute when you use it, not while you're thinking.
And because Flash endpoints are real Runpod serverless endpoints, you get all the production benefits: autoscaling, cold start management, and GPU availability across our full fleet.
Flash is open source and ready to use today.
pip install runpod-flashThe examples repository includes walkthroughs covering everything from hello-world GPU scripts to CPU workers, mixed worker configurations, dependency management, and load-balanced endpoints with custom HTTP routes.
We're excited to make serverless development feel as natural as writing local Python. Give Flash a try, and let us know what you build.

We've just released a way to run Serverless code without needing to build a Docker image: check it out.
.jpeg)
One of the most common pieces of feedback we hear from developers building on Serverless is that the experience isn't as frictionless as working in a pod. Between writing Dockerfiles, building images, pushing to a registry, and wiring everything up to an endpoint, the overhead can turn a quick experiment into a multi-hour chore. Today, we're changing that with Flash: an application framework that lets you deploy GPU-accelerated Python functions to Runpod Serverless with a single decorator.
No Docker. No container orchestration. Just Python.
Flash is a Python SDK for distributed inference and orchestration on Runpod's serverless infrastructure. You write functions locally, decorate them with @remote, and Flash takes care of everything else: provisioning the serverless endpoint, selecting the GPU hardware you specify, installing your dependencies, and returning the results. Docker still runs under the hood to execute your code, but you never need to touch it.
The only things you need to get started are a RunPod account with a balance, an API key, and the runpod-flash package installed on your local machine.
pip install runpod-flash
export RUNPOD_API_KEY=[YOUR_API_KEY]The workflow is straightforward. You define a Python function with the @endpoint decorator, specifying the GPU type, worker count, and any pip dependencies your code needs. When you run the script, Flash silently creates and manages a serverless endpoint behind the scenes — you'll even see it appear in the RunPod console if you want to inspect it. Your function executes on the remote hardware and the results come back to your terminal.
Here's a minimal example that runs a matrix multiplication on a remote GPU:
from runpod_flash import endpoint
import torch
@endpoint(
name="matrix-multiply",
gpu="RTX4090",
workers=[1, 3],
dependencies=["torch"]
)
async def multiply(request):
a = torch.randn(1000, 1000, device="cuda")
b = torch.randn(1000, 1000, device="cuda")
result = torch.matmul(a, b)
return {"shape": list(result.shape), "sample": result[0][:5].tolist()}Flash isn't just for standalone scripts. You can pair it with FastAPI to build full production APIs in under 50 lines of code. The pattern is clean: define your Flash-decorated worker functions in one file, mount a FastAPI router in another, and use flash run to serve the whole thing.
# endpoint.py
from runpod_flash import endpoint
import torch
import numpy as np
@endpoint(
name="matrix-service",
gpu=["ADA_24", "AMPERE_48"],
workers=[1, 3],
dependencies=["torch", "numpy"]
)
async def compute(request):
size = request.get("size", 1024)
# Generate random matrices and multiply on GPU
a = torch.randn(size, size, device="cuda")
b = torch.randn(size, size, device="cuda")
result = torch.matmul(a, b)
return {
"gpu": torch.cuda.get_device_name(0),
"matrix_size": f"{size}x{size}",
"result_sample": result[0][:5].cpu().tolist(),
"result_mean": result.mean().cpu().item(),
}# main.py
from fastapi import FastAPI
from endpoint import router
app = FastAPI()
app.include_router(router)Then just run:
flash runIn a second terminal, send a request:
curl -X POST http://localhost:8888/compute \
-H "Content-Type: application/json" \
-d '{"size": 2048}'And you'll get a JSON response back from the GPU:
{
"gpu": "NVIDIA GeForce RTX 4090",
"matrix_size": "2048x2048",
"result_sample": [12.34, -8.91, 45.67, -23.12, 7.56],
"result_mean": 0.0023
}That's a full production API — decorator, router, and live GPU inference — in under 50 lines.
We built Flash because we believe the barrier to entry for serverless GPU computing should be as low as possible. Docker is powerful, but for many developers. Especially those prototyping, experimenting, or iterating quickly; it adds friction that slows down the creative loop.
Flash changes the economics of development too. Instead of keeping a pod running while you code and test, you can send requests to a serverless endpoint only when you're ready. You pay for compute when you use it, not while you're thinking.
And because Flash endpoints are real Runpod serverless endpoints, you get all the production benefits: autoscaling, cold start management, and GPU availability across our full fleet.
Flash is open source and ready to use today.
pip install runpod-flashThe examples repository includes walkthroughs covering everything from hello-world GPU scripts to CPU workers, mixed worker configurations, dependency management, and load-balanced endpoints with custom HTTP routes.
We're excited to make serverless development feel as natural as writing local Python. Give Flash a try, and let us know what you build.
The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.