Hot starts, batch inference, and what's next for Runpod Serverless. Webinar June 25.

How to Run MoonshotAI’s Kimi-K2-Instruct on Runpod Instant Cluster

Run MoonshotAI's Kimi-K2-Instruct on Runpod Clusters using H200 SXM GPUs and a 2TB shared network volume for seamless multi-node training. This guide.

How to Run MoonshotAI’s Kimi-K2-Instruct on Runpod Instant Cluster

1. Create Network Storage (2TB), Use the CA-MTL-4 region (recommended for now).

2. Spin Up a Pod using Runpod official Pytorch template and mount the network volume you just created, once the pod is running, connect to Jupyter Lab

3. Download the Model

  1. In the Jupyter terminal, install the Hugging Face CLI and download the model:


4. Launch the Instant Cluster

  • Choose Standard Cluster
  • Select the network volume you created
  • Set Pod Count to 2 (default is 2)
  • Pick H200 SXM GPU type
  • Select Runpod Pytorch official template
  • Click Edit Template and set container disk size to be 2000 GB
  • Click Deploy Cluster to launch it.

1. installation on Node 0 with a shared volume

2. Node 1 Instructions

3. You should see the following:


4. Run on node with ip as host.

Runpod console Clusters view showing two 8x H200 pods with GPU utilization and memory metrics
Jupyter notebook cell testing the Kimi K2 chat completions API with a successful response

Open AI Implementation

Expected Output:

Known Issues

Currently as of July 21st, vllm library is not up to date so need to build from nightly builds.

https://github.com/MoonshotAI/Kimi-K2/issues/19

Gotchas

uv environment on a Network Volume is slow to initialize ray, recommend any python environments be ran on the machine itself instead of on the Network Volume.

Author profile: Brendan McKeag

Related articles

View All
Deploy When Available is now GA

Deploy When Available is now GA

Queue for any GPU spec, even one that's fully rented out, and we'll deploy it the moment capacity opens up. No more refreshing the console or running a sniping tool.

All
The Chips Got Faster. The Stack Didn't.

The Chips Got Faster. The Stack Didn't.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.

All

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.