Moonshot AI's Kimi K2.5 is the base model behind Cursor Composer 2 and Composer 2.5 — a 1-trillion-parameter, 32B-active Mixture-of-Experts trained on 15.5 trillion tokens under a Modified MIT license with an attribution threshold that Cursor almost certainly crossed in late 2025.
Most coverage of the Composer 2 / K2.5 story fixated on the 24-hour denial in March 2026 and Moonshot's public embarrassment. The more consequential angle is technical and economic: the K2.5 base model is the reason Cursor can price Composer 2.5 at $0.50 per million input tokens — roughly one-tenth of Claude Opus 4.7's rate — and still deliver frontier-class agentic coding output. The open-weight cost floor is structural, not accidental.
This guide covers the K2.5 architecture in detail, the MuonClip training innovation that enabled crash-free 15.5T-token training, Cursor's Lee Robinson compute-split disclosure (the first public number quantifying the open-weight-base + proprietary-RL pattern), and the Modified MIT attribution question that remains publicly unresolved. For a broader look at the K2 model family, see our Kimi K2.5 Agent Swarm guide and the K2-0905 September 2025 checkpoint deep-dive.
- 01Composer 2.5 ships today on the same K2.5 checkpoint as Composer 2.Cursor confirmed in the May 18 launch blog: 'Composer 2.5 is built on the same open-source checkpoint as Composer 2, Moonshot’s Kimi K2.5.' This post's anchor is that launch.
- 02K2's architecture is the most aggressively sparse MoE in the cohort.1T total / 32B active = 3.2% activation rate. DeepSeek-V3 activates 5.5%, Qwen3-235B 9.4%, Mixtral 8x22B 27.7%. That sparsity is the cost-floor mechanism.
- 03MuonClip is the under-cited training breakthrough that made K2 possible.Moonshot's Muon + QK-Clip combination enabled crash-free training of a trillion-parameter model on 15.5T tokens. No other open-weight MoE at this scale has been trained to completion without instability.
- 04The 1/4–3/4 compute split is the first public number for the open-base pattern.Lee Robinson's March 20, 2026 X-post: 'Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training.' That ratio reframes the entire industry's economics.
- 05The Modified MIT attribution clause is the unresolved compliance question.Cursor reportedly crossed $1B ARR in November 2025, well above the $20M monthly revenue threshold that triggers the 'Kimi K2.5' UI attribution requirement. Whether that attribution appears in the Cursor UI remains publicly unconfirmed.
01 — WHY NOWComposer 2.5 shipped today on K2.5 — the base model deserves its own examination.
Cursor's Composer 2.5 launch blog published earlier today (May 18, 2026) was notable for something the March announcement lacked: transparency. Unlike the Composer 2 launch, which initially described the model as "in-house" before developer @fynnso identified the Kimi tokenizer within 24 hours, Cursor's 2.5 copy opened with a clear attribution: "Composer 2.5 is built on the same open-source checkpoint as Composer 2, Moonshot's Kimi K2.5."
The lesson from the March incident apparently landed. But the base model itself — its architecture, training innovations, and the economics that make it the preferred foundation for a $1B+ ARR coding product — still lacks a dedicated analysis. That is what this guide provides.
For the Composer 2.5 product story specifically, see our Composer 2.5 launch guide. For context on how Cursor's earlier Composer 1.5 introduced the “RL greater than pretraining” disclosure, see our Composer 1.5 deep-dive.
02 — ARCHITECTURE1T total, 32B active, 384 experts — 61 layers of MLA + SwiGLU.
Kimi K2 is a Mixture-of-Experts transformer with 1 trillion total parameters and 32 billion activated per forward pass. The architecture is documented in the HuggingFace model card and the official GitHub README. K2.5 follows the same architecture family — Moonshot has not published a separate K2.5 technical report as of this writing; the quantized AMD and NVIDIA variants on HuggingFace confirm the same MoE shape.
The depth profile is 61 layers total: 1 dense layer followed by 60 MoE layers. Attention uses Multi-Head Latent Attention (MLA) with 64 heads and a 7,168-dimensional hidden space. Each MoE expert has a 2,048-dimensional hidden dimension with SwiGLU activation. Vocabulary: 160K tokens. Context window: 128K. Model format: Block-fp8.
The expert routing is the critical architectural detail for understanding the economics. Of 384 experts per MoE layer, 8 are selected per token via learned routing plus 1 shared expert that participates in every forward pass. That 8+1-of-384 configuration means only 9 experts activate per token, producing the 3.2% activation rate examined in the comparison table in Section 06. The Kimi K2 technical report (arXiv:2507.20534) is the peer-citable anchor for all architecture figures.
Full MoE weight count
1 trillion total parameters across 61 layers (1 dense + 60 MoE). Block-fp8 format. Same architecture family across K2 and K2.5 checkpoints.
3.2% activation rate
Only 32B parameters activate per forward pass. 8 routed experts + 1 shared expert per MoE layer. Most aggressive sparsity in the frontier MoE cohort.
8 selected + 1 shared
384 total experts per MoE layer. 8 selected via learned routing per token, plus 1 shared expert always active. 64 attention heads, MLA attention mechanism.
Zero training instability
Pre-trained on 15.5 trillion tokens with zero loss spikes or training crashes. Made possible by the MuonClip optimizer (Section 03). 128K context window.
Two design decisions distinguish K2's architecture from earlier large-scale MoE systems. First, the extremely fine-grained expert count — 384 is roughly 4.5× the expert count of Mixtral 8x22B and nearly 1.5× that of DeepSeek-V3's reported configuration. More experts means finer-grained specialization per forward pass, which translates to higher output quality per activated FLOP. Second, the shared expert ensures a stable base representation for every token regardless of routing outcomes, reducing variance in outputs when routing decisions are uncertain.
The MLA attention mechanism, shared with the DeepSeek-V3 family, further reduces KV cache size at inference time by compressing key-value representations into a lower-dimensional latent space. Together, the fine-grained routing and MLA attention are the two structural reasons K2.5 can serve as a cost-floor inference backend for Cursor's Composer 2.5 at $0.50 per million input tokens.
03 — MUONCLIPThe training breakthrough nobody is citing.
The March 2026 attribution incident dominated coverage of K2.5. Lost in that story was the technical novelty that actually made K2 possible: a training optimizer called MuonClip, which Moonshot describes as "the Muon optimizer applied to an unprecedented scale, with novel optimization techniques to resolve instabilities while scaling up."
MuonClip is Moonshot's K2-specific extension of the Muon optimizer. Standard Muon is a weight-decay variant that uses RMS-matching to normalize gradient updates. At trillion-parameter scale on 15.5 trillion tokens, standard Muon produces attention logit blow-ups — large spikes in the softmax inputs of the attention mechanism that cause loss spikes and, in many cases, training crashes.
Moonshot's addition is QK-Clip: a mechanism that caps the magnitude of query and key projections before they enter the attention softmax, preventing the logit blow-ups that cause divergence. The combination — Muon for fast convergence, QK-Clip for stability — enabled Moonshot to train the full 1T-parameter model on 15.5T tokens with zero reported instability events.
No other published open-weight MoE at this scale has demonstrated comparable crash-free training on a comparable token budget. This is not a marginal improvement — it is the reason the K2 checkpoint exists at all. Training at this scale without MuonClip would have required either a smaller model, a smaller token budget, or a sequence of costly crash-recovery restarts. The engineering teams that actually used K2.5 as a base (Cursor, and potentially others who have not disclosed) benefited directly from Moonshot having solved this stability problem first.
04 — THE 1/4 SPLITLee Robinson's number — the first public compute-attribution split.
When Cursor was forced to acknowledge the K2.5 base in March 2026, co-founder Lee Robinson posted a statement on X that contained one sentence the industry has not fully absorbed:
“Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training.”
This is the first publicly disclosed compute-attribution split for a commercial open-weight-base + proprietary-RL coding model. The prior state of the industry was pure speculation: everyone assumed post-training was significant but nobody had a number. Robinson gave a number — approximately 25% pretrain, approximately 75% Cursor-owned post-training — under circumstances that made fabrication costly.
Three implications follow from that ratio. First, the pretrain cost is amortized nearly entirely by Moonshot; Cursor's $0.50/M token pricing reflects the marginal cost of inference on a base they did not pay to train. Second, the 3/4 post-training fraction explains why Composer 2's benchmark profile diverges so sharply from the K2 baseline: 47.3% SWE-Bench Multilingual (K2 base) versus 79.8% (Composer 2.5 post-RL) is the accumulated benefit of Cursor's own training investment. Third, the ratio suggests that the K2.5 base is genuinely the “scaffold” — a structurally sound starting point — rather than a near-finished product Cursor merely fine-tuned.
The split also reframes what “open-weight base” means commercially. Moonshot released a base that costs 1/4 of a Composer 2.5-equivalent to produce. Cursor added 3/4 of the production compute on top. The Modified MIT license allows this pattern with attribution; the question is whether attribution actually appeared. For the post-training revolution angle, see our overview in the post-training RL as the new moat analysis.
05 — MARCH DISPUTE24 hours of denial, one API token, and a deleted post.
On March 19, 2026, Cursor announced Composer 2 as an “in-house” model. The claim held for less than 24 hours. Developer @fynnso made API calls to the Composer 2 endpoint and captured the model identifier in the response headers: kimi-k2p5-rl-0317-s515-fast. The identifier was unambiguous: Kimi K2.5 fine-tuned with RL, checkpoint date March 17, 2026.
Moonshot AI's head of pretraining, Yulun Du, posted a statement on X that became the defining artifact of the incident: "Wait. We tested with composer 2 model API and found out the tokenizer is indeed the same with our Kimi tokenizer! We can almost confirm this is our model post-trained further! We are shocked that @cursor_ai did not respect our license nor did they pay us any fees!" That post has since been deleted, consistent with the conciliatory statement Moonshot released later the same day after Cursor clarified that inference was routed through Fireworks AI under a commercial partnership.
Moonshot's official follow-up acknowledged the partnership and reframed the incident as a communication failure rather than a license violation: "Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support." The timeline and verbatim quotes are sourced from The Open Source Press' definitive reconstruction.
The incident has a structural lesson for the open-weight-base era: when a commercial product is built on an open-weight base, the underlying model's tokenizer, architecture fingerprints, and system-prompt patterns are inspectable by anyone with API access. The “in-house” framing was not sustainable past the first technical audit. Composer 2.5's upfront attribution reflects that lesson.
06 — MOE MAPK2 vs DeepSeek-V3 vs Qwen3 — the activation-rate column nobody publishes.
The table below is our own synthesis from published architecture sources. The “Active / total” column — the activation rate — is the differentiating view: it directly predicts inference cost per quality unit and explains why K2 became Cursor's preferred base. For a broader MoE architecture comparison including GPT-family and Claude models, see our MoE architecture comparison guide.
Moonshot AI — Modified MIT
1T total / 32B active / 384 experts (8+1 routing) / 3.2% activation. The most aggressively sparse MoE in the cohort. Adopted by Cursor Composer 2 and 2.5. MLA attention, SwiGLU, 15.5T training tokens.
DeepSeek AI — open-weight
671B total / 37B active / ~256 experts (9+1 routing) / 5.5% activation. Fine-grained MoE design closely related to K2’s lineage. Multi-Token Prediction and auxiliary-loss-free load balancing. Strong coding and math.
Alibaba — open-weight
235B total / 22B active / 128 experts / 9.4% activation. Alibaba’s flagship MoE. Strong multilingual and reasoning. Higher activation rate than K2 or DeepSeek-V3 — better quality-per-token for non-coding workloads.
Mistral AI — Apache 2.0
141B total / 39B active / 8 experts (2-of-8 routing) / 27.7% activation. The reference-generation MoE that demonstrated open MoE viability. Much higher activation rate reflects the coarser 8-expert design.
OpenAI — Apache 2.0 (Aug 2025)
117B total / 5.1B active / 4.4% activation. OpenAI’s first open-weight release in six years. MoE design with aggressive sparsity. Validates that even closed-lab incumbents now play the open-weight game.
The activation-rate column reveals the generational shift in MoE design. Mixtral 8x22B (27.7%) was the efficiency frontier in early 2024; two years later, K2's 3.2% activation rate delivers comparable or superior output quality at roughly 1/8 the inference cost. That compression is not free — it required the fine-grained 384-expert design, the MLA attention mechanism, and the MuonClip training stability to scale up — but the downstream effect on pricing is structural. Cursor's $0.50/M standard input pricing is only possible because the base model activates 3.2% of its parameters per token.
The GPT-OSS-120B entry is instructive for a different reason: OpenAI's August 2025 open-weight release confirms that even closed-lab incumbents now view open-weight distribution as strategically necessary. The pattern is one-directional. See the full GPT-OSS announcement for architecture details.
07 — PATTERNCursor, Sourcegraph Cody, Tabnine — three production patterns on the same thesis.
The open-weight base + proprietary RL pattern is not unique to Cursor. Three major coding-tool vendors have independently arrived at the same architectural thesis; they differ in where they sit on the cost/control spectrum. Our Cursor Composer 2 deep-dive covers the consumer product in detail; this section maps the pattern across the cohort.
Cursor Composer 2 / 2.5
The compute-heaviest variant. Cursor runs continued pretraining plus high-compute RL on top of the K2.5 base. Inference via Fireworks AI commercial partnership. Proprietary training data and RL reward model. One model serves all users.
Sourcegraph Cody Enterprise
Sourcegraph routes completions through Cody Gateway, defaulting to DeepSeek V2 Lite Base for autocomplete and Claude Sonnet 4.5 for chat. Enterprise customers can configure custom fine-tunes on a chosen open-weight base. Lower compute commitment than Cursor; more routing flexibility.
Tabnine Trainbox
Tabnine clones its Universal model into a customer-specific ‘Trainbox’ instance, then retrains on the customer’s private repositories. Four deployment tiers: SaaS, VPC, On-Prem, Air-Gapped. Maximum data isolation; highest per-customer compute cost.
The three patterns represent distinct equilibria on the same cost/control tradeoff. Cursor maximizes performance by investing 3/4 of total compute in post-training but sacrifices per-customer customization — every Composer user gets the same model. Sourcegraph trades some performance ceiling for routing flexibility and enterprise configurability. Tabnine trades performance and compute efficiency for maximum data isolation and customer-specific adaptation, which is the only viable path for air-gapped enterprise deployments.
No vendor has fully disclosed the base model for all their configurations, and Sourcegraph's Cody model configuration docs and Tabnine's fine-tuned AI models documentation are the most transparent primary sources available for their respective approaches.
08 — LICENSEThe 100M-MAU + $20M-MRR clause — and Cursor's compliance question.
Kimi K2.5 is released under a Modified MIT license, not standard MIT. The critical addition is an attribution clause: any commercial product or service that exceeds either 100 million monthly active users OR $20 million in monthly revenue must display “Kimi K2.5” in the product user interface.
Cursor's reported $1 billion annualized revenue in November 2025 (per secondary reporting in The Open Source Press, citing industry sources) implies monthly revenue well above the $20M threshold. The attribution clause therefore almost certainly applies. Whether “Kimi K2.5” currently appears in the Cursor UI — in model selectors, about pages, or settings — remains publicly unconfirmed as of this writing.
Lee Robinson's March 20 statement included the line “we are following the license through our inference partner terms,” which implies that Cursor's Fireworks AI inference agreement may include compliance provisions for the attribution clause. That framing is plausible but not independently verified — the inference-partner channel for satisfying an end-product UI attribution requirement is architecturally unusual.
The broader implication is that the Modified MIT license creates a category of tail risk for open-weight base adoption that pure Apache 2.0 licenses (GPT-OSS, Mistral's models) do not. Any engineering team evaluating K2.5 as a base for a commercial product should verify the current license text against the HuggingFace model card before shipping to production, and consult legal counsel if the revenue or MAU thresholds are within reach. For organizations navigating open-weight licensing across multiple models, our AI transformation engagement includes license due diligence as part of the model selection phase.
The open-weight cost floor is not free. It comes with a Modified MIT attribution clause that triggers above $20M monthly revenue — and the cleanest commercial proof of the open-base pattern is also its highest-profile compliance question.Digital Applied synthesis, May 18, 2026
09 — EXITThe Cursor + SpaceXAI 10× compute commitment — the planned exit from K2.5 dependence.
Buried in the Composer 2.5 launch blog was the strategic implication the headline numbers obscured. After disclosing Composer 2.5's K2.5 base, Cursor announced a separate effort: “Together with SpaceXAI, we’re training a significantly larger model from scratch, using 10× more total compute. With Colossus 2's million H100-equivalents and our combined data and training techniques, we expect this to be a major leap in model capability.”
The framing is explicit: a from-scratch pretrain on Colossus 2, Cursor's own training data and techniques, 10× the total compute relative to the K2.5 base run. If the K2.5 base represents roughly $X in pretrain compute, the from-scratch run represents roughly $10X — an absolute figure Cursor has not disclosed. The practical outcome, if the training run succeeds, is a Composer 3 that is not K2.5-dependent and does not carry the Modified MIT attribution clause.
The economics of this exit are interesting. K2.5 lowers Cursor's inference cost by amortizing Moonshot's pretrain across the ecosystem. A from-scratch Cursor pretrain moves that cost entirely onto Cursor's balance sheet — and raises the question of whether Cursor's own pretrain will match the quality of K2.5's MuonClip-enabled 15.5T-token run. MuonClip is not a proprietary secret (it is described in the K2 technical report), but applying it at scale requires engineering investment that Cursor may or may not have internalized.
The from-scratch commitment does not signal that K2.5 is inadequate; it signals that Cursor views base-model independence as a strategic necessity at their scale. The K2.5 open-weight cost-floor argument holds for any vendor that has not crossed the threshold to justify a proprietary pretrain. For the architecture comparison that makes this decision legible, see our frontier MoE architecture comparison.
Composer compute attribution — K2.5 base vs Cursor post-training vs planned from-scratch run
Source: Lee Robinson X-post (Mar 20, 2026) via The Open Source Press; Cursor blog Composer 2.5 (May 18, 2026)The 1/4-3/4 split is the cleanest number the open-weight era has produced.
Kimi K2.5 and Cursor's RL stack form the most-documented example of a two-layer commercial coding model. The Robinson disclosure — approximately one quarter of the final model's compute came from the Moonshot pretrain, three quarters came from Cursor's post-training — is the first time a commercial open-weight-base consumer has quantified that ratio publicly. Expect more vendors to publish similar figures as the pattern matures; the transparency pressure from the March attribution incident has reset the norms.
The Modified MIT attribution clause is the under-discussed compliance question in this story. Cursor reportedly crossed $1B ARR in late 2025 — well above the $20M monthly revenue trigger. Whether “Kimi K2.5” appears in the Cursor UI as required is still publicly unconfirmed. This is the open-weight licensing tail-risk of the pattern: a model can be genuinely open-weight with a commercially permissive license and still contain attribution requirements that create compliance obligations for high-revenue consumers. Teams evaluating K2.5 as a base should treat the Modified MIT license as a material factor alongside the architecture and benchmark numbers.
The Cursor + SpaceXAI from-scratch commitment — 10× total compute on Colossus 2's approximately one million H100-equivalents — is the planned exit from K2.5 dependence. K2.5 wins the cost-floor competition today; the question for the next 18 months is whether closed-weight bases, retrained with comparable post-training discipline, can close the gap. The MuonClip training methodology is now documented in the K2 technical report. The techniques are available. The compute is the variable.