AIUPATE - Just Ai update Hub

Kimi K1.5: A New Way to Make Model Reasoning Bigger

Artificial intelligence is changing from "more data" to "more computation at inference time". Moonshot AI's release of Kimi K1.5 is a big step forward in this change. Kimi K1.5 uses advanced reinforcement learning (RL) to "think" before it speaks, which makes it better at solving difficult logical, mathematical, and coding problems. Traditional Large Language Models (LLMs) often use pattern matching.

The Main Idea: Scaling Law 2.0

In the past, scaling laws said that models got smarter when they had more parameters and training tokens. Kimi K1.5 puts more emphasis on inference-time scaling. This means that the model has learnt to think more about a problem and look at different ways of reasoning before drawing a conclusion.

Important new features of Kimi K1.5 Reinforcement Learning on a Large Scale: Kimi K1.5 was trained using a large-scale RL framework, which is different from regular fine-tuning. The model learns by trying things out and getting rewards for getting the right answer through a logical "Chain of Thought" (CoT).

Multi-Modal Integration: The model in question is one of the first models to use these huge RL reasoning breakthroughs in both text and vision. This lets it solve hard visual puzzles or read technical diagrams with a lot of accuracy.

Long-Context Reasoning: K1.5 keeps its logical consistency even when processing hundreds of thousands of tokens, thanks to Moonshot AI's reputation for long-context windows.

How it Works: The RL Architecture

Kimi K1.5 uses a method that is often compared to "OpenAI's o1" method. It focuses on System 2 thinking, which is the slow, deliberate, and logical part of the human brain.

1. The Policy and Reward System

The model uses a policy network to make steps in its reasoning. A reward model (or a verifiable environment like a code compiler or math checker) supplies feedback during training. If the model makes a mistake in logic, it is punished; if it fixes itself and finds the right answer, it is rewarded.

2. The Chain of Thought (CoT)

Kimi K1.5 doesn't just make a decision right away. It makes an internal dialogue that helps you understand the problem.

3. Checking and searching.

The model uses "Search" to make decisions. This makes it possible for it to:

Self-Correct: Find mistakes in its own logic while it's still going.

Backtrack: Give up on a line of reasoning that isn't going anywhere and try a new one.

Check: Use outside tools or logic checks within the system to make sure the final output is strong.

Standards for Performance

Kimi K1.5 has performed remarkably similarly to (and in some cases better than) global leaders like GPT-4o and o1-mini in certain areas:

Benchmark Domain Kimi K1.5 Performance

Kimi K1.5 has shown remarkable parity with (and in some cases, superiority over) global leaders like GPT-4o and o1-mini in specific domains

Why This Is Important

Kimi K1.5's success shows that reinforcement learning is the main engine for the next generation of AI. Models are getting better by not just copying human text but also finding logical rules through RL.

• More Reliable: Less likely to "hallucinate" in maths and logic.

• Better Problem Solvers: Able to handle science and engineering problems at the PhD level.

 <table border="1">
 <thead>
  <tr>
   <th>Benchmark</th>
   <th>Domain</th>
   <th>Kimi K1.5 Performance</th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td>MATH</td>
   <td>Advanced Mathematics</td>
   <td>Near-state-of-the-art accuracy</td>
  </tr>
  <tr>
   <td>AIME</td>
   <td>Math Olympiad Level</td>
   <td>High success rate via long-reasoning</td>
  </tr>
  <tr>
   <td>Human Eval</td>
   <td>Coding & Python</td>
   <td>Exceptional logic and debugging</td>
  </tr>
  <tr>
   <td>MMMU</td>
   <td>Multi-modal Reasoning</td>
   <td>Strong visual-logical correlation</td>
  </tr>
 </tbody>
</table>

More efficient: getting smarter without needing trillions of extra parameters.

Kimi K1.5 is more than just a chatbot; it's a reasoning engine that shows us a future where AI can think, check, and solve problems with the same level of rigour that only human experts used to have.

RLVR stands for "Reinforcement Learning from Verifiable Rewards.

Most LLMs use RLHF (RL from Human Feedback), which is when a reward model tries to figure out what a person would like. Kimi K1.5 uses RLVR, which changes "human taste" to "objective truth".

• The Verifier: K1.5 uses a deterministic verifier for math and code, such as a Python compiler or a symbolic math engine.

• The Reward: If the model gets the right answer, it gets $1; if it gets the wrong answer, it gets $0.

• The Result: This stops "reward hacking", which is when a model learns to sound polite to get a high score even though it isn't right. The model must create a real Chain of Thought (CoT) because that's the only way to always get the right "verifiable" answer.

Online Mirror Descent (OMD) for Policy Optimisation

Many models use PPO (Proximal Policy Optimisation), but Kimi K1.5 uses a different version of Online Mirror Descent. This is a more stable way to change the model's "policy", which is how it picks the next word.

PPO vs. OMD in Kimi K1.5

• PPO (The Standard): Uses a "critic" model to figure out how much each state is worth. This costs a lot of processing power and is often unstable for very long reasoning chains.

• Kimi's Choice: Online Mirror Descent It uses relative entropy regularisation. It treats learning as a journey through a "probability space," which keeps the model's new reasoning strategy from straying too far from its original stable base even as it learns new and more complicated logic.

Compression Phase:

The second round of RL uses a brevity penalty. The model receives points for finding the shortest path of reasoning that still leads to the right answer that can be checked.

Important Formula:

The reward function usually looks like this:

R = R. verify – α. Length,

where α is a small penalty that stops people from "yapping" without making mistakes.

Why not DPO?

DPO (direct preference optimisation) is popular because it's "offline", which means you don't have to run the model while it's training. But Kimi K1.5 doesn't use pure DPO for reasoning because DPO is limited by the dataset: it can only learn from pairs of "better vs. worse" that are already there.

RL is active because Kimi's OMD-based RL lets the model explore. It can try a reasoning path that no one has ever written before, see that it works, and "level up" its own intelligence.

Summary Table: Comparing Algorithms

<thead>

<tr>

<th>Feature</th>

<th>Standard LLM (DPO)</th>

</tr>

</thead>

<tbody>

<tr>

<td>Learning Source</td>

<td>Static human rankings</td>

<td>Dynamic "Verifiers" (Code/Math)</td>

</tr>

<tr>

<td>Stability</td>

<td>High (Offline)</td>

<td>High (Mirror Descent)</td>

</tr>

<tr>

<td>Self-Correction</td>

<td>Limited</td>

<td>High (Learned via exploration)</td>

</tr>

<tr>

<td>Optimization</td>

<td>Binary Classification</td>

<td>Iterative Policy Refinement</td>

</tr>

</tbody>

</table>

The "rollout" phase of training a large language model is the most expensive part. This stage is where the model actually writes text (the "thinking" steps) before being graded. A single training batch for a model like Kimi K1.5, which can reason for up to 128,000 tokens, could be too big for even the most powerful GPUs (like the H100).

Moonshot AI used partial rollouts, a smart infrastructure trick that breaks the "all-or-nothing" generation cycle, to correct this.

The "Long-Tail" Problem

In standard reinforcement learning, the trainer sends a group of prompts to the GPUs. Until the slowest, longest response completes, the system is unable to determine the rewards and update the model.

• The Waste: 31 GPUs are sitting around doing nothing while they wait for that one long "tail" to finish, even though 31 prompts only need 500 tokens.

• The Crash: When you try to make and store 128k tokens for a whole batch at once, it makes a "memory wall", which causes Out-of-Memory (OOM) errors.

How Partial Rollouts Work

Kimi K1.5 sees reasoning as a series of steps instead of one long stream.

1- The Token Budget (Capping)

The system doesn't let the model run until it hits a "Stop" token. Instead, it sets a fixed token budget for each iteration. For instance, it might only let you make 4,000 tokens at a time. This page is a step-by-step guide to the method.

2. Cutting up and saving

• The generation stops if the model is halfway through a thought when it reaches the 4,000-token limit.

• The current state, which is the "Chain of Thought" so far, is stored in a Replay Buffer.

• The GPU is free to work on another prompt right away.

3. Resumption that isn't in sync

The system checks the replay buffer in the next training iteration. It picks up the "unfinished" reasoning right where it left off. The system can mix and match because the workers don't work at the same time:

• New Tasks: Making short responses from scratch.

• Buffered Tasks: Continuing long-context reasoning segments from the last step.

Benefits of this Architecture:

GPU Efficiency Kimi K1.5 achieves the most throughput by getting rid of the "idle wait" for long-tail responses. All GPUs are always doing maths, not waiting.

Scaling up to 128k: The system can train on ultra-long reasoning paths that would be physically impossible to fit into GPU memory because it only processes one segment of the 128k tokens at a time for loss computation.

Repeat Detection: The system can run "early termination" if it sees that the model is stuck in a loop (for example, if it keeps making the same logical mistake over and over), which saves a lot of computing power.

Summary: Standard RL vs. Kimi’s Partial Rollouts

Standard RL Rollout Partial Rollout of Kimi Standard RL Rollout

<thead>

<tr>

<th>Feature</th>

<th>Standard RL Rollout</th>

<th>Kimi K1.5 Partial Rollout</th>

</tr>

</thead>

<tbody>

<tr>

<td>Generation</td>

<td>Must finish the whole sequence</td>

<td>Can be paused and resumed</td>

</tr>

<tr>

<td>GPU Usage</td>

<td>High idle time (waiting for long tails)</td>

<td>Near 100% utilization (asynchronous)</td>

</tr>

<tr>

<td>Memory Limit</td>

<td>Limi

Pro Tip: This method makes LLM training a "streamed" process, like how video platforms buffer small parts of a movie instead of making you download the whole 4K file before you start watching.

Browse All Categories

Categories

Account

Kimi K1.5: A New Way to Make Model Reasoning Bigger

Written by M Rousol

Enjoying this article?

Discussion

Log in to comment

Related Reads

Top 10 Gadgets of 2025

2025 The year of AI Agents

The Rise of the AI Agentic Economy in 2025