We've cooked up a bunch of improvements designed to reduce friction and make the.



Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
Moonshot AI's release of Kimi-K2-Instruct represents a watershed moment in open-source artificial intelligence. This state-of-the-art mixture-of-experts language model combines 32 billion activated parameters with 1 trillion total parameters, trained using the innovative Muon optimizer to achieve exceptional performance across frontier knowledge, reasoning, and coding tasks. What sets Kimi K2 apart is its meticulous optimization for agentic capabilities, making it particularly powerful for autonomous problem-solving and tool use scenarios, not to mention the fact that it is the largest open-source large language model by a country mile, nearly doubling Deepseek’s 671b’s parameters. (Llama 4 Behemoth will be 2 trillion parameters, but unfortunately there has been no further news on it since April.)

K2’s model's benchmark results speak to its remarkable capabilities. Kimi K2 achieves 89.5% on MMLU, 97.4% on MATH-500, and an impressive 65.8% pass@1 on SWE-bench Verified tests with bash/editor tools. These numbers position it among the most capable open-source models available, competing directly with proprietary alternatives while offering the freedom and control that comes with local deployment.
Where this fits into the current open-source ecosystem is it can be considered a well-rounded overall upgrade to Deepseek V3-0324 as well as R1-0524 while standing as a reasonable competitor to the recent crop of closed-source large models (Grok 4, Sonnet/Opus 4, etc.)
Being that this is such a large model, there’s some enormous heavy lifting that needs to be done to get it off the ground. You’ll need 2 terabytes of VRAM to load the model at full weights, necessitating one of our instant clusters. Fortunately, 8-bit precision will cut that requirement down to 1 terabyte, meaning that this is within reach of a single 8xH200 or 6xB200 pod. If you’d like to test out the model without a lot of technical investment, this will drastically release the overhead involved and you can always upgrade to a cluster later if you find you need the full 16-bit precision. Due to some technical considerations specific to the model, our official vLLM templates will need to be updated for full compatibility, but you can always simply run a Pytorch pod and install vLLM manually in the meantime to begin using this landmark model immediately.
You’ll also 1.2TB of disk volume space to hold the model. If you’re planning on running an API server, you’ll also want to expose port 8000 so the server can see incoming requests.

We’ll be using Jupyter Notebook to demonstrate the code and dependencies required to begin inferring.
Run the following in the terminal to install requirements:
Let’s configure environment variables for multi-GPU inference and define our model:
Next, we’ll download the model. Budget 30 minutes to an hour to download the weights.
Now we'll configure vLLM for optimal performance with our hardware:
On an 8xH200 pod you may only be able to squeeze about ~16k of context into the memory at 8 bits. If you require more space, then you’ll want to use 7 or 8xB200s instead. Once lower-bit quantizations are available they will offer more flexibility for memory use versus inference quality tradeoffs.
Important: Based on the official Moonshot AI deployment guide, Ray executor is recommended for Kimi-K2-Instruct multi-GPU inference.
Load the model with our optimized configuration:
Finally, let’s test the model with some prompts.
As you can see, we’re getting some pretty respectable token/second speed due to K2’s Mixture of Experts (MoE) architecture - even at a staggering 1t total parameters, we’re still running at about 40-50 tokens/second even on an H200; the B200s are likely to be even faster.
**Setting Up Production API Serving**
Starting the API server builds directly on our existing configuration, requiring only minor modifications to account for the serving context. The model path follows HuggingFace's standard cache structure. When we downloaded the model to `./model_cache`, HuggingFace automatically organized it as `./model_cache/models--moonshotai--Kimi-K2-Instruct/snapshots/[hash]/` where the hash represents the specific model version. Using the wildcard `*` in the path ensures the command works regardless of the specific snapshot hash.
Once the API server is running, applications can interact with Kimi-K2-Instruct using standard OpenAI client requests, making integration seamless for existing applications. The compatibility extends beyond basic chat completion to include streaming responses, function calling, and custom parameters, providing a comprehensive API surface that matches commercial offerings.
What began as a technical challenge—fitting 1 trillion parameters across 8 GPUs—evolved into a comprehensive solution that rivals even the most recent commercial AI services while maintaining complete organizational control. The successful deployment of Kimi-K2-Instruct locally represents more than just a technical achievement; it signifies a fundamental shift toward AI democratization and infrastructure independence.
There is still room to grow, as well. If you need the full 128k of context that this model offers, or if you need to run it at full weights, consider one of our Instant Clusters to equip yourself for this undertaking.

Moonshot AI’s Kimi-K2-Instruct is a trillion-parameter, mixture-of-experts open-source LLM optimized for autonomous agentic tasks—with 32 billion active parameters, Muon-trained performance rivaling proprietary models (89.5 % MMLU, 97.4 % MATH-500, 65.8 % pass@1), and the ability to run inference on as little as 1 TB of VRAM using 8-bit quantization.

Moonshot AI's release of Kimi-K2-Instruct represents a watershed moment in open-source artificial intelligence. This state-of-the-art mixture-of-experts language model combines 32 billion activated parameters with 1 trillion total parameters, trained using the innovative Muon optimizer to achieve exceptional performance across frontier knowledge, reasoning, and coding tasks. What sets Kimi K2 apart is its meticulous optimization for agentic capabilities, making it particularly powerful for autonomous problem-solving and tool use scenarios, not to mention the fact that it is the largest open-source large language model by a country mile, nearly doubling Deepseek’s 671b’s parameters. (Llama 4 Behemoth will be 2 trillion parameters, but unfortunately there has been no further news on it since April.)

K2’s model's benchmark results speak to its remarkable capabilities. Kimi K2 achieves 89.5% on MMLU, 97.4% on MATH-500, and an impressive 65.8% pass@1 on SWE-bench Verified tests with bash/editor tools. These numbers position it among the most capable open-source models available, competing directly with proprietary alternatives while offering the freedom and control that comes with local deployment.
Where this fits into the current open-source ecosystem is it can be considered a well-rounded overall upgrade to Deepseek V3-0324 as well as R1-0524 while standing as a reasonable competitor to the recent crop of closed-source large models (Grok 4, Sonnet/Opus 4, etc.)
Being that this is such a large model, there’s some enormous heavy lifting that needs to be done to get it off the ground. You’ll need 2 terabytes of VRAM to load the model at full weights, necessitating one of our instant clusters. Fortunately, 8-bit precision will cut that requirement down to 1 terabyte, meaning that this is within reach of a single 8xH200 or 6xB200 pod. If you’d like to test out the model without a lot of technical investment, this will drastically release the overhead involved and you can always upgrade to a cluster later if you find you need the full 16-bit precision. Due to some technical considerations specific to the model, our official vLLM templates will need to be updated for full compatibility, but you can always simply run a Pytorch pod and install vLLM manually in the meantime to begin using this landmark model immediately.
You’ll also 1.2TB of disk volume space to hold the model. If you’re planning on running an API server, you’ll also want to expose port 8000 so the server can see incoming requests.

We’ll be using Jupyter Notebook to demonstrate the code and dependencies required to begin inferring.
Run the following in the terminal to install requirements:
Let’s configure environment variables for multi-GPU inference and define our model:
Next, we’ll download the model. Budget 30 minutes to an hour to download the weights.
Now we'll configure vLLM for optimal performance with our hardware:
On an 8xH200 pod you may only be able to squeeze about ~16k of context into the memory at 8 bits. If you require more space, then you’ll want to use 7 or 8xB200s instead. Once lower-bit quantizations are available they will offer more flexibility for memory use versus inference quality tradeoffs.
Important: Based on the official Moonshot AI deployment guide, Ray executor is recommended for Kimi-K2-Instruct multi-GPU inference.
Load the model with our optimized configuration:
Finally, let’s test the model with some prompts.
As you can see, we’re getting some pretty respectable token/second speed due to K2’s Mixture of Experts (MoE) architecture - even at a staggering 1t total parameters, we’re still running at about 40-50 tokens/second even on an H200; the B200s are likely to be even faster.
**Setting Up Production API Serving**
Starting the API server builds directly on our existing configuration, requiring only minor modifications to account for the serving context. The model path follows HuggingFace's standard cache structure. When we downloaded the model to `./model_cache`, HuggingFace automatically organized it as `./model_cache/models--moonshotai--Kimi-K2-Instruct/snapshots/[hash]/` where the hash represents the specific model version. Using the wildcard `*` in the path ensures the command works regardless of the specific snapshot hash.
Once the API server is running, applications can interact with Kimi-K2-Instruct using standard OpenAI client requests, making integration seamless for existing applications. The compatibility extends beyond basic chat completion to include streaming responses, function calling, and custom parameters, providing a comprehensive API surface that matches commercial offerings.
What began as a technical challenge—fitting 1 trillion parameters across 8 GPUs—evolved into a comprehensive solution that rivals even the most recent commercial AI services while maintaining complete organizational control. The successful deployment of Kimi-K2-Instruct locally represents more than just a technical achievement; it signifies a fundamental shift toward AI democratization and infrastructure independence.
There is still room to grow, as well. If you need the full 128k of context that this model offers, or if you need to run it at full weights, consider one of our Instant Clusters to equip yourself for this undertaking.
The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.