Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Last week on Monday, AWS experienced a significant outage affecting multiple services in the us-east-1 region. The disruption impacted thousands of sites and millions of users across the internet, and Runpod was certainly not immune to these effects.

During the outage, Runpod console availability was impacted as our upstream provider, Vercel, depended on this region. Serverless endpoints continued to receive requests but couldn’t process them due to the impact on our worker management microservice. Users experienced issues with Pod provisioning and access, while others encountered extended delay times throughout the platform. Our payment processing system was also impacted during this period, and we took steps after the fact to ensure customers were not charged for resources that they could not utilize.

Understanding our dependencies

Some customers questioned why Runpod was impacted by an AWS outage at all. While Runpod has over 40 data centers designed for AI application development and deployment, we leverage AWS infrastructure to host critical portions of our control plane. This architecture has enabled us to scale our web application effectively, but it also means that AWS availability directly impacts our platform's operational status.

But we want to underline the fact that Runpod's GPU compute resources remain entirely independent from our control plane. Pod workloads remained operational during the AWS outage, and even when the Runpod UI was unavailable, your Pods, endpoints, and clusters remained intact and secure. Once connectivity was restored, these resources returned to their normal operational state within without data loss or configuration changes. Similarly for Serverless, as soon as the coordination microservice was back online, workers could resume processing requests as normal.

Immediate infrastructure improvements

Following the outage, our engineering team immediately began implementing critical redundancies for our infrastructure. Within 72 hours of the incident, Runpod's engineering team deployed our core services across multiple AWS regions, and if AWS suffers another outage like this, our platform is prepared to failover to a healthy region to stay online.

We also enhanced our Serverless platform's resilience to control plane disruptions. If necessary, workers can now use their cached configurations, allowing them to continue accepting and processing requests for an extended period, even if the central service is unavailable. When connection is restored, the workers’ distributed state automatically synchronizes back up with the control plane. This distribution of state reduces the blast radius risk if AWS or any other core internet service suffers another outage.

Our roadmap to resilience

This outage was a painful lesson for us, but a valuable one. While we’re proud of our core design, which separates your compute resources from the control plane to keep your workloads safe, we are far from satisfied. Our long-term roadmap includes transitioning to a partitioned multi-region deployment hosted entirely on Runpod's own provider network, with automated load balancing and failover capabilities.

While we cannot prevent external infrastructure failures, we are committed to building a platform that remains resilient throughout such events. We appreciate your patience during last week’s disruption and we’ll continue providing updates on our progress.

‍

Understanding our dependencies