We've cooked up a bunch of improvements designed to reduce friction and make the.



Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
Before diving into the setup, it's worth understanding why you'd want to do this in the first place. There are four compelling reasons:
Cost. Ten dollars goes significantly further when you're self-hosting. In this guide, we use a 20B coding model quantized to 4-bit, which runs comfortably on an A4500 at just $0.25/hour — giving you nearly 40 hours of unlimited use for what you might spend in an hour or two with a larger Claude model if you're not careful.
Right-sizing your model to the task. If you're generating boilerplate Python scripts or simple utilities, you don't need Opus — or even Haiku. Practically any competent coding model can one-shot those tasks. Paying per-token rates for a frontier model on simple work is overkill, and self-hosting lets you tune your spend to match the complexity of what you're building.
Compliance and security. If your work involves trade secrets, sensitive data, or specific security requirements around tool calling and OS-level access, large hosted foundational models may not meet your needs. When you bring your own model, you're connecting Claude Code to an LLM engine under your direct control — one you can inspect, configure, and extend as needed.
Domain-specific fine-tuning. You can swap in models fine-tuned for specific domains: a model trained heavily on Python, one optimized for data science, or any other specialized variant. This matters especially with smaller models, which benefit greatly from fine-tuning since they lack the broad general knowledge of larger frontier models.
Scroll to the A4500 GPU (currently around $0.25/hour) and select the Ollama template. Give the container a bit of extra disk space in case you need it, then deploy.
While the pod boots, think about your model selection. This is important: if you want Claude Code's full tool-calling capabilities — where it edits files autonomously and takes real actions in your codebase — you need a model that explicitly supports tool calling. Not every open-source model does. For this guide, we're using a fine-tuned version of GPT-OSS-20B that has been adapted specifically for tool calling.
Once your pod is running, connect to it via the terminal and pull your model::
ollama run slekrem/gpt-oss-claude-code-32kYou can then test it with a quick 'hello world" in the terminal.
Spin up a second pod — an A6000 running the latest PyTorch template works well. This is the pod where you'll install and run Claude Code.
Install Claude Code the same way you would normally, then install a terminal text editor:
apt-get update && apt-get install nanoClaude Code needs to know where to send its requests. Navigate to the Claude configuration directory and open settings.json:
cd ~/.claude
nano settings.jsonAdd the environment variables that point Claude Code at your Ollama pod. You'll need your Ollama pod's ID from the RunPod dashboard — paste it into the appropriate field in the config. The full settings snippet is available in the video description.
If you don't have an active Anthropic account, you'll need to bypass the authentication screen. Create a small shell script that returns a dummy API key:
# Create api_key_helper.sh
echo '#!/bin/bash' >> api_key_helper.sh
echo 'echo "dummy-key"' >> api_key_helper.sh
chmod +x api_key_helper.shThen reference this script in your settings.json under the apiKeyHelper field with the path to the file. When you launch Claude Code, it will skip the login screen entirely and connect directly to your Ollama pod.
Here's an example settings.json that you can use:
{
"apiKeyHelper": "/root./claude/api-key-helper.sh",
"env": {
"ANTHROPIC_BASE_URL": "https://yourpodidgoeshere-11434.proxy.runpod.net",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "slekrem/gpt-oss-claude-code-32k:20b"
}
}Launch Claude Code from your workspace directory and ask it a simple question:
Which model am I speaking to?
If everything is configured correctly, you'll see the model identify itself as your Ollama-hosted model — not Claude. You're now routing entirely through your own infrastructure.
We ran a few tests to see how a small quantized model holds up for real coding tasks.
Snake game — Asked the model to build a terminal-based Snake game with arrow key controls, apple collection, and score tracking. It one-shot the working game on the first attempt. Impressive for a 4-bit quantized 20B model.
Tetris — Same story. The model one-shotted a terminal Tetris game. When we added a follow-up request for rotation controls and better speed, it integrated those changes cleanly in a second pass.
Web search — The model correctly flagged that it doesn't have native web browsing capability. However, when given a direct URL, it was able to fetch and summarize the page — a useful workaround for targeted lookups even without a true search integration.
Open-ended architecture questions — This is where the limits showed. When asked to "choose the best framework for a REST API" with no additional context, the model got stuck — spending several minutes searching an empty codebase before eventually stalling out. Small models need more direction. They don't carry the same planning and reasoning depth as frontier models, so vague or open-ended prompts tend to produce poor results.
The bottom line: for well-defined coding tasks — generating scripts, building small applications, writing boilerplate — a self-hosted model on RunPod can match or exceed what you'd need from a hosted model at a tiny fraction of the cost. For complex, multi-step reasoning or ambiguous architecture decisions, you may still want to reach for a larger model.
The key to success with smaller models is the same best practice that applies to AI coding assistants generally: be specific. Break work into small, concrete tasks. The more granular your prompt, the better your results — regardless of which model you're using.
Ready to try it yourself? You'll need:
If you need further help, check out our Youtube video on the topic:
If you build something cool with this setup, drop it in the comments on the video or let us know in the Discord. Happy building!

If you've been using Claude Code with Anthropic's hosted models, you already know how powerful it is for AI-assisted development. But what if you could run the same workflow for a fraction of the cost, with complete control over the underlying model? In this guide, we'll walk you through connecting Claude Code to a self-hosted model running on Runpod using Ollama — no Anthropic API key required.

Before diving into the setup, it's worth understanding why you'd want to do this in the first place. There are four compelling reasons:
Cost. Ten dollars goes significantly further when you're self-hosting. In this guide, we use a 20B coding model quantized to 4-bit, which runs comfortably on an A4500 at just $0.25/hour — giving you nearly 40 hours of unlimited use for what you might spend in an hour or two with a larger Claude model if you're not careful.
Right-sizing your model to the task. If you're generating boilerplate Python scripts or simple utilities, you don't need Opus — or even Haiku. Practically any competent coding model can one-shot those tasks. Paying per-token rates for a frontier model on simple work is overkill, and self-hosting lets you tune your spend to match the complexity of what you're building.
Compliance and security. If your work involves trade secrets, sensitive data, or specific security requirements around tool calling and OS-level access, large hosted foundational models may not meet your needs. When you bring your own model, you're connecting Claude Code to an LLM engine under your direct control — one you can inspect, configure, and extend as needed.
Domain-specific fine-tuning. You can swap in models fine-tuned for specific domains: a model trained heavily on Python, one optimized for data science, or any other specialized variant. This matters especially with smaller models, which benefit greatly from fine-tuning since they lack the broad general knowledge of larger frontier models.
Scroll to the A4500 GPU (currently around $0.25/hour) and select the Ollama template. Give the container a bit of extra disk space in case you need it, then deploy.
While the pod boots, think about your model selection. This is important: if you want Claude Code's full tool-calling capabilities — where it edits files autonomously and takes real actions in your codebase — you need a model that explicitly supports tool calling. Not every open-source model does. For this guide, we're using a fine-tuned version of GPT-OSS-20B that has been adapted specifically for tool calling.
Once your pod is running, connect to it via the terminal and pull your model::
ollama run slekrem/gpt-oss-claude-code-32kYou can then test it with a quick 'hello world" in the terminal.
Spin up a second pod — an A6000 running the latest PyTorch template works well. This is the pod where you'll install and run Claude Code.
Install Claude Code the same way you would normally, then install a terminal text editor:
apt-get update && apt-get install nanoClaude Code needs to know where to send its requests. Navigate to the Claude configuration directory and open settings.json:
cd ~/.claude
nano settings.jsonAdd the environment variables that point Claude Code at your Ollama pod. You'll need your Ollama pod's ID from the RunPod dashboard — paste it into the appropriate field in the config. The full settings snippet is available in the video description.
If you don't have an active Anthropic account, you'll need to bypass the authentication screen. Create a small shell script that returns a dummy API key:
# Create api_key_helper.sh
echo '#!/bin/bash' >> api_key_helper.sh
echo 'echo "dummy-key"' >> api_key_helper.sh
chmod +x api_key_helper.shThen reference this script in your settings.json under the apiKeyHelper field with the path to the file. When you launch Claude Code, it will skip the login screen entirely and connect directly to your Ollama pod.
Here's an example settings.json that you can use:
{
"apiKeyHelper": "/root./claude/api-key-helper.sh",
"env": {
"ANTHROPIC_BASE_URL": "https://yourpodidgoeshere-11434.proxy.runpod.net",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "slekrem/gpt-oss-claude-code-32k:20b",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "slekrem/gpt-oss-claude-code-32k:20b"
}
}Launch Claude Code from your workspace directory and ask it a simple question:
Which model am I speaking to?
If everything is configured correctly, you'll see the model identify itself as your Ollama-hosted model — not Claude. You're now routing entirely through your own infrastructure.
We ran a few tests to see how a small quantized model holds up for real coding tasks.
Snake game — Asked the model to build a terminal-based Snake game with arrow key controls, apple collection, and score tracking. It one-shot the working game on the first attempt. Impressive for a 4-bit quantized 20B model.
Tetris — Same story. The model one-shotted a terminal Tetris game. When we added a follow-up request for rotation controls and better speed, it integrated those changes cleanly in a second pass.
Web search — The model correctly flagged that it doesn't have native web browsing capability. However, when given a direct URL, it was able to fetch and summarize the page — a useful workaround for targeted lookups even without a true search integration.
Open-ended architecture questions — This is where the limits showed. When asked to "choose the best framework for a REST API" with no additional context, the model got stuck — spending several minutes searching an empty codebase before eventually stalling out. Small models need more direction. They don't carry the same planning and reasoning depth as frontier models, so vague or open-ended prompts tend to produce poor results.
The bottom line: for well-defined coding tasks — generating scripts, building small applications, writing boilerplate — a self-hosted model on RunPod can match or exceed what you'd need from a hosted model at a tiny fraction of the cost. For complex, multi-step reasoning or ambiguous architecture decisions, you may still want to reach for a larger model.
The key to success with smaller models is the same best practice that applies to AI coding assistants generally: be specific. Break work into small, concrete tasks. The more granular your prompt, the better your results — regardless of which model you're using.
Ready to try it yourself? You'll need:
If you need further help, check out our Youtube video on the topic:
If you build something cool with this setup, drop it in the comments on the video or let us know in the Discord. Happy building!
The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.