Run machine learning
inference at scale.
Only pay for what you use — no idle costs, just unparalleled speed and scalability.
1
2
3
4
5
2
3
4
5
import runpod
def handler(job):
def handler(job):
job_input = job["input"]
return "Running on Runpod!"
runpod.serverless.start({"handler":handler})return "Running on Runpod!"
runpodctl -- zsh
Spend more time training your models.
Let us handle your inference.
For your expected load, provision active workers running 24/7 with a 30% discount & flex workers to handle any sudden traffic.
Try it now Autoscale in seconds
Respond to user demand in real time with GPU workers that
scale from 0 to 100s in seconds.
scale from 0 to 100s in seconds.
Flex
Workers
Active
Workers
10 GPUs
6:24AM
100 GPUs
11:34AM
20 GPUs
1:34PM
Active Workers
-30% DISCOUNT
Dedicated GPUs that handle consistent workloads 24/7.
Get them at a lower cost so you don't break the bank for stable usage.
Get them at a lower cost so you don't break the bank for stable usage.
Flex Workers
Flexible GPUs that cost nothing when idle.
Ready to scale up as soon as your launch goes viral.
Ready to scale up as soon as your launch goes viral.
Zero Cold-Starts with Active Workers
No cold-start time — because the workers are always
running. Get instant execution when speed is all that matters.
running. Get instant execution when speed is all that matters.
<250ms Cold-Start with Flashboot
Flashboot is an optimization layer for our container system to manage deployments and scale up workers in real time.
Handle more consistent workloads like fine-tuning
Scale workers by Queue Delay or Request count
Monitor your endpoint with
real-time analytics
Usage Analytics
Real-time usage analytics for your endpoint with metrics on completed and failed requests. Useful for endpoints that have fluctuating usage profiles throughout the day.
See the console Active
Requests
Completed:
2,277
Retried:
21
Failed:
9
Execution Time
Total:
1,420s
P70:
8s
P90:
19s
P98:
22s
Execution Time Analytics
Debug your endpoints with detailed metrics on execution time. Useful for hosting models that have varying execution times, like large language models. You can also monitor delay time, cold start time, cold start count, GPU utilization, and more.
See the console Real-Time Logs
Get descriptive, real-time logs to show you exactly what's happening across your active and flex GPU workers at all times.
See the console worker logs -- zsh
2024-03-15T19:56:00.8264895Z INFO | Started job db7c79
2024-03-15T19:56:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
12% |██ | 4/28 [00:00<00:01, 12.06it/s]
38% |████ | 12/28 [00:00<00:01, 12.14it/s]
77% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:56:04.7438407Z INFO | Completed job db7c79 in 2.9s
2024-03-15T19:57:00.8264895Z INFO | Started job ea1r14
2024-03-15T19:57:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
15% |██ | 4/28 [00:00<00:01, 12.06it/s]
41% |████ | 12/28 [00:00<00:01, 12.14it/s]
80% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:57:04.7438407Z INFO | Completed job ea1r14 in 2.9s
2024-03-15T19:58:00.8264895Z INFO | Started job gn3a25
2024-03-15T19:58:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
18% |██ | 4/28 [00:00<00:01, 12.06it/s]
44% |████ | 12/28 [00:00<00:01, 12.14it/s]
83% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:58:04.7438407Z INFO | Completed job gn3a25 in 2.9s
2024-03-15T19:56:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
12% |██ | 4/28 [00:00<00:01, 12.06it/s]
38% |████ | 12/28 [00:00<00:01, 12.14it/s]
77% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:56:04.7438407Z INFO | Completed job db7c79 in 2.9s
2024-03-15T19:57:00.8264895Z INFO | Started job ea1r14
2024-03-15T19:57:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
15% |██ | 4/28 [00:00<00:01, 12.06it/s]
41% |████ | 12/28 [00:00<00:01, 12.14it/s]
80% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:57:04.7438407Z INFO | Completed job ea1r14 in 2.9s
2024-03-15T19:58:00.8264895Z INFO | Started job gn3a25
2024-03-15T19:58:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
18% |██ | 4/28 [00:00<00:01, 12.06it/s]
44% |████ | 12/28 [00:00<00:01, 12.14it/s]
83% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:58:04.7438407Z INFO | Completed job gn3a25 in 2.9s
Cost effective for every inference workload
Save 15% over other Serverless cloud providers on flex workers alone.
Create active workers and configure queue delay for even more savings.
GPU Price Per
Flex
Active
80 GB
A100
$0.00076
$0.00060
High throughput GPU, yet still very cost-effective.
80 GB
H100
PRO
$0.00155
$0.00124
Extreme throughput for big models.
48 GB
A6000
$0.00034
$0.00024
A cost-effective option for running big models.
48 GB
L40
PRO
$0.00053
$0.00037
Extreme inference throughput on LLMs like Llama 3 7B.
24 GB
A5000
$0.00019
$0.00013
Great for small-to-medium sized inference workloads.
24 GB
4090
PRO
$0.00031
$0.00021
Extreme throughput for small-to-medium models.
16 GB
A4000
$0.00016
$0.00011
The most cost-effective for small models.
Thousands of GPUs across 9 Regions
Update your endpoint's region in two clicks. Scale up to 9 regions at a time. Global automated failover is supported out-of-the-box, so you won't have to worry about GPU errors interrupting your ML inference.
Pending Certifications
Although many of our data center partners have these compliance certifications, RunPod is in the process of getting SOC 2, ISO 27001, and HIPAA. We aim to have all three by early Q3, 2024.
North America
UR-OR-1
CA-MTL-1
CA-MTL-2
European Union
EUR-IS-1
EUR-IS-2
EUR-NO-1
Europe
EU-NL-1
EU-RO-1
EU-SE-1
Serverless Pricing Calculator
seconds
$ 42 /mo
1
72,000 requests per month
1. Cost estimation includes 50% of the requests using active price & running into 1s cold-start.
We're with you from seed to scale
Book a call with our sales team to learn more.
Gain
Additional Savings
with ReservationsSave more by committing to longer-term usage. Reserve discounted active and flex workers by speaking with our team.
Book a call Are you an early-stage startup or ML researcher?
Get up to $25K in free compute credits with Runpod. These can be used towards on-demand GPUs and Serverless endpoints.
Apply CTO, LOVO AI
Hara Kang
"It really shows that RunPod is made by developers. They know exactly what engineers really want and they ship those features in order of importance."
Hara Kang - CTO, LOVO AI
5,057,372,397
5,057,371,897
requests & 100k+ developers since launch
Get started with RunPod
today.
We handle millions of serverless requests a day. Scale your machine learning inference while keeping costs low.