Hot starts, batch inference, and what's next for Runpod Serverless. Webinar June 25.

Reduce Your Serverless Automatic1111 Start Time

If you're using the Automatic1111 Stable Diffusion repo as an API layer, startup speed matters. This post explains two key Docker-level optimizations.

Reduce Your Serverless Automatic1111 Start Time

I've found that many users are using the Automatic1111 stable diffusion repo not only as a GUI interface, but as an API layer. If you're trying to scale a service on top of A1111, shaving off a few seconds from your start time can be really important. If you need to make your automatic1111 install start faster, this is the article for you!

We will be referencing the files found in this repository for this blog post: https://github.com/runpod/containers/tree/main/serverless-automatic

There are two major performance optimizations that we will cover in this blog post:

1) Make sure that needed huggingface files are cached

2) Pre-calculate the model hash

Both of these optimizations are taken care of in the Dockerfile line that runs the cache.py script:

The cache.py script simply imports and runs a few functions from webui and modules out of automatic1111:


If you run this against an installation of Automatic via command line, you will find that it will do two major things:

1) It will download some files and store them in the huggingface cache (/root/.cache/huggingface)

If you don't do this prior to launching your serverless template, it will have to download these files on every cold start! yikes!

2) It will calculate the model hash and store it in /workspace/stable-diffusion-webui/cache.json. Automatic does this by default on launch. You can also disable this by using the --no-hashing command line argument.

Here's the comparison before and after:

Before


After


We have found that the startup time for automatic1111 is very cpu-bound, which means that a faster CPU will yield a faster startup time. We've found this to be a linear relationship to single-core CPU performance.

If you look closely, you will see that there is still a relatively long time spent importing both the pytorch and gradio modules. The next blog post will cover possibly optimizing these import times. Stay tuned!

Author profile: Pardeep Singh

Related articles

View All
Deploy When Available is now GA

Deploy When Available is now GA

Queue for any GPU spec, even one that's fully rented out, and we'll deploy it the moment capacity opens up. No more refreshing the console or running a sniping tool.

All
The Chips Got Faster. The Stack Didn't.

The Chips Got Faster. The Stack Didn't.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.

All

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.