How to Run Ollama, Whisper, and ComfyUI Together in One Container

The world of AI tooling is rapidly evolving, and developers are increasingly seeking ways to run multiple models and interfaces in streamlined environments. Whether you're experimenting with multimodal AI pipelines or building a real-world app powered by language, voice, and image capabilities, running Ollama (for language models), Whisper (for speech-to-text), and ComfyUI (for visual workflows) together in a single container can significantly accelerate development and testing.

This article will guide you through setting up all three tools in a single GPU-enabled container using Runpod, a powerful cloud computing platform that supports custom containers, template deployments, and API access. You'll learn how to structure the environment, manage GPU resources efficiently, and even make use of Runpod's GPU templates for a faster launch.

Why Combine Ollama, Whisper, and ComfyUI?

Running these three tools together in one container isn’t just a convenience—it’s a practical solution for building:

End-to-end AI pipelines (e.g., convert audio to text with Whisper → summarize with Ollama → generate image with ComfyUI).
Interactive demos for AI startups.
Local inference APIs for real-time media processing.
Multimodal workflows without switching environments or processes.

Let’s take a quick look at what each tool does:

Ollama: A user-friendly interface for running large language models like LLaMA, Mistral, and Phi on your local machine or container.
Whisper: OpenAI's powerful automatic speech recognition (ASR) model for real-time or batch audio-to-text conversion.
ComfyUI: A modular, node-based UI for executing complex Stable Diffusion pipelines with full visual control.

Combining these in one container means your app can "hear", "think", and "see"—all at once.

What You’ll Need Before You Start

A Runpod account (sign up is free).
Familiarity with Docker and basic shell commands.
A working knowledge of Python-based environments.
Access to the Runpod Container Launch Guide.
Runpod pricing info to choose the right GPU instance.

Step 1: Choose the Right GPU Instance on Runpod

When launching a container with multiple heavy AI models, GPU power matters. Based on testing, here are the recommended minimums:

1x NVIDIA A10G or higher
16GB VRAM
60GB+ Disk Space

You can explore all available GPU options on Runpod’s pricing page. Note that availability may vary depending on your region.

Step 2: Build Your Custom Dockerfile

You'll need to write a custom Dockerfile that installs all dependencies for Ollama, Whisper, and ComfyUI. Below is a minimal example:

Dockerfile
CopyEdit
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04

# Install system packages
RUN apt-get update && apt-get install -y \
git python3 python3-pip wget ffmpeg curl libgl1 unzip && \
apt-get clean

# Install Python packages
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RUN pip3 install openai-whisper

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Install ComfyUI
RUN git clone https://github.com/comfyanonymous/ComfyUI.git /workspace/ComfyUI
WORKDIR /workspace/ComfyUI
RUN pip3 install -r requirements.txt

# Expose default ports
EXPOSE 11434 8188

# Start all services
CMD ["bash", "-c", "ollama serve & python3 /workspace/ComfyUI/main.py --listen 0.0.0.0 --port 8188"]

Best practice: Consider using a process manager like supervisord to manage multiple services gracefully.

Tip: Check out Runpod's Dockerfile best practices to avoid common build errors.

Step 3: Launch Your Container on Runpod

Once your Dockerfile is ready:

Go to Runpod GPU Templates.
Choose "Custom Image".
Upload your Dockerfile or use a GitHub repo link.
Set your container to expose ports 11434 (Ollama) and 8188 (ComfyUI).
Select a compatible GPU and storage size.
Launch the container.

Follow the container launch guide for a complete walkthrough.

Step 4: Pull Models in the Container

Once the container is running:

For Ollama:

bash
CopyEdit
ollama pull mistral

For Whisper (built-in):
You can call Whisper via:

bash
CopyEdit
whisper input.mp3 --model medium

For ComfyUI:
Make sure your ComfyUI container has Stable Diffusion weights. You can download them into:

bash
CopyEdit
/workspace/ComfyUI/models/checkpoints/

Refer to ComfyUI GitHub for exact model requirements.

Step 5: Access Your Services

After deployment:

Visit https://<your-runpod-endpoint>:11434 to interact with the Ollama API.
Visit https://<your-runpod-endpoint>:8188 for the ComfyUI visual interface.
Use SSH or terminal to run Whisper commands directly or set up a web service (like Flask or FastAPI) to expose Whisper functionality.

Optional: Add API Integration

You can expose each tool’s functionality via your own API. For example, wrap Whisper in a FastAPI app:

python
CopyEdit
from fastapi import FastAPI, UploadFile
import whisper

app = FastAPI()
model = whisper.load_model("medium")

@app.post("/transcribe/")
async def transcribe(file: UploadFile):
audio = await file.read()
with open("temp.wav", "wb") as f:
f.write(audio)
result = model.transcribe("temp.wav")
return {"text": result["text"]}

You can deploy this script inside your container and expose port 8000 for API access.

Pro Tip: Explore Runpod's API Docs to automate inference calls and scaling.

Monitoring and Managing Resources

When running three services in one container, keep an eye on:

VRAM usage: Use nvidia-smi to track GPU memory usage.
CPU load: Whisper can be CPU-heavy for long audio.
Disk usage: SD models used by ComfyUI can take up 3GB+ per checkpoint.

You can upgrade or restart containers on Runpod via the dashboard if resource needs change.

Optimizing GPU Usage for Concurrent Model Execution

Running multiple models like Ollama, Whisper, and ComfyUI in the same environment can lead to GPU contention if not managed carefully. Use strategies such as batching inference requests, limiting parallel processing threads, or deploying lightweight quantized models when possible. Runpod offers templates designed for AI workloads—check out their GPU Templates for a quick start that aligns with your performance needs. Ensuring that each model gets adequate GPU access without starving others will significantly enhance the responsiveness and stability of your pipeline.

Automating Deployment with Runpod API

To streamline your multi-model environment setup, consider automating deployments using the Runpod API Docs. You can programmatically launch containers, allocate GPUs, manage volumes, and even configure environment variables on the fly. This is particularly useful when building scalable inference pipelines or integrating Runpod into a CI/CD workflow. Automating your setup not only saves time but also reduces human error—ensuring that each deployment is consistent and production-ready/

Conclusion

By deploying Ollama, Whisper, and ComfyUI in a single Runpod container, you gain the ability to process text, audio, and images in one place. This setup is perfect for multimodal experimentation, prototype development, or even scalable deployment using APIs.

Ready to get started?

Sign up for Runpod to launch your AI container, inference pipeline, or notebook with GPU support today.

Frequently Asked Questions (FAQs)

What are the pricing tiers on Runpod?

Runpod offers both on-demand and spot GPU pricing. Spot pricing can be up to 60% cheaper. You can view the full list of GPU pricing tiers here.

Are there limits on how many containers I can run?

Yes. Free accounts may have limited GPU hours. Paid tiers unlock more instances and persistent volumes. Refer to Runpod container limits for detailed quotas.

What GPUs are available on Runpod?

You’ll find GPUs such as A10G, RTX 4090, A100, and others. Availability depends on region and provider. Explore options on the pricing page.

Can I use other models with Ollama or Whisper?

Absolutely. Ollama supports custom models (LLaMA, Mistral, etc.) via local .bin files or remote pulls. Whisper models include tiny to large-v2. See Whisper GitHub for details.

How do I structure my Dockerfile properly?

Best practices include:

Installing all dependencies early.
Avoiding unnecessary layers.
Exposing all needed ports.
Managing long-running services via supervisord or multi-threaded bash commands.

Check Runpod’s Dockerfile guide for more.

Can I access all services via public IP?

Yes, but you must expose their ports in Runpod settings. Use reverse proxies for production environments. For private access, Runpod also supports SSH tunneling.

Where can I find working examples?

Runpod's model deployment examples include popular setups for LLMs, SD pipelines, and audio transcription workflows.