Cursor shipped Composer 2.5 earlier today — May 18, 2026 — on the same Moonshot Kimi K2.5 open-weight checkpoint that powered Composer 2, with a second-generation reinforcement learning layer that benchmarks suggest matches Claude Opus 4.7 on SWE-Bench Multilingual at $0.50/$2.50 per million tokens, approximately one-tenth the per-token cost of Opus 4.7's $5/$25 standard tier.
The cost story is straightforward on paper. The governance story is less so: the headline CursorBench v3.1 number (63.2%) is built, run, and scored by Cursor itself — the same entity authoring the model. Disciplined buyers will weight the SWE-Bench Multilingual and Terminal-Bench 2.0 numbers more heavily, since those benchmarks are not Cursor-controlled. And on those third-party surfaces, the Composer 2.5 result is genuinely compelling: 79.8% on Multilingual (0.7 points behind Opus 4.7's 80.5%) and 69.3% on Terminal-Bench (essentially tied with Opus 4.7's 69.4%), at a fraction of the price.
This guide covers what shipped today, the open-weight-base plus proprietary-RL architecture that produced it, the controlled A/B experiment embedded in the Composer 2 to 2.5 transition, the lock-in tradeoff of a Cursor-IDE-only distribution model, and what the SpaceX partnership actually signals about the roadmap — which is Composer 3, not this release.
- 01Same base, 60 days, materially better benchmarks.Composer 2.5 uses the identical Kimi K2.5 checkpoint as Composer 2. The +11.0/+6.1/+7.6 percentage-point gains on CursorBench, SWE-Bench Multilingual, and Terminal-Bench are the product of 25x more synthetic training tasks and a targeted-textual-feedback RL approach — not a new base model.
- 02Standard pricing held flat; Fast tier doubled.Standard remains $0.50 in / $2.50 out per Mtok — identical to Composer 2. Fast tier jumped from $1.50/$7.50 to $3.00/$15.00, a 100% increase. Cursor is pricing routing economics into the tiers as Fast becomes the default for interactive sessions and Standard handles background batch work.
- 03CursorBench is vendor-controlled — treat accordingly.Cursor builds and scores CursorBench. The 63.2% score for Composer 2.5 outperforms Opus 4.7 at default settings (61.6%), but Opus 4.7 Adaptive reaches 64.8%. Weight SWE-Bench Multilingual and Terminal-Bench 2.0 numbers more heavily — those benchmarks are not controlled by Cursor.
- 04No public API — Cursor IDE only at launch.Composer 2.5 is not exposed via any third-party API surface. Not on OpenRouter, not on Bedrock, not on Vertex. It is available exclusively inside the Cursor IDE. This is a deliberate distribution choice that inverts the standard model-as-commodity pattern.
- 05SpaceX means Composer 3, not Composer 2.5.Cursor's own announcement is unambiguously future-tense: "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute." Composer 2.5 runs on Cursor's existing RL pipeline. Colossus 2's ~1M H100-equivalents are the infrastructure for the next-generation model.
01 — LAUNCHWhat shipped today: pricing, base model, and the 1/10 framing.
Cursor's May 18 release announcement — published as cursor.com/blog/composer-2-5 — describes Composer 2.5 as "frontier-level at coding" at $0.50/M input and $2.50/M output tokens, calling it "a new, optimal combination of intelligence and cost." Two tiers ship simultaneously: Standard at $0.50/$2.50 and Fast at $3.00/$15.00. A launch-week promotion doubles usage allocations for Cursor subscribers for the first seven days.
The 1/10 cost framing is precise on per-token list price. Opus 4.7 sits at $5.00 in / $25.00 out per Mtok (1M context, flat pricing per Anthropic's April 16 announcement). Composer 2.5 Standard is exactly 10x cheaper on both input and output. GPT-5.5 sits at $5.00 in / $30.00 out per Mtok below 272K tokens — with a 2x input surcharge above that threshold — making Composer 2.5 Standard 10x cheaper on input and 12x cheaper on output against the GPT-5.5 list rate.
The base model is disclosed in the announcement: Moonshot AI's Kimi K2.5, a 1.04-trillion-parameter Mixture-of-Experts model (32B activated per token) released January 29, 2026 under a Modified MIT license. This is the same open-weight checkpoint that powered Composer 2 — a detail that becomes important when interpreting the benchmark deltas in Section 05.
02 — ARCHITECTUREOpen-weight base plus proprietary reinforcement learning.
The Composer line embodies what is becoming the dominant frontier-coding architecture: take a permissively-licensed open-weight base model — in this case Kimi K2.5 — and layer proprietary post-training on top. Cursor has disclosed more about this pipeline with each release. For Composer 1.5: the "20x RL scale-up" disclosure established that the bulk of differentiation sits in post-training, not pre-training. DataCamp's analysis of the Composer 2.5 announcement estimates that roughly 85% of total compute for the final model comes from Cursor's post-training work — a striking ratio if accurate, though that specific figure is DataCamp's interpretation rather than a direct Cursor disclosure.
What Cursor does disclose directly for Composer 2.5: the model is trained with "targeted textual feedback" — short text hints inserted at exact decision points where the model went off-trajectory during RL rollouts, rather than scoring only the final outcome. This addresses a fundamental challenge Cursor identifies in their announcement: "credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens." Scoring a final diff is easy; knowing which of 200 intermediate steps caused the failure is not.
The training infrastructure uses a Sharded Muon optimizer with dual mesh HSDP, achieving a 0.2-second step time on the ~1T-parameter model. Muon is the same optimizer family Moonshot used to pretrain the original Kimi K2 at unprecedented efficiency — Cursor's adoption of it for post-training at this scale is a meaningful data point in the post-training is the new moat thesis.
03 — BENCHMARKSCursorBench, SWE-Bench Multilingual, and Terminal-Bench numbers.
Three benchmark surfaces are reported in the Composer 2.5 announcement and cross-referenced in DataCamp's independent coverage. Section 08 addresses the governance flags in detail. The chart below shows Composer 2.5 alongside Opus 4.7 (at its best reported settings) and GPT-5.5 (at its best reported settings) on each surface. All percentages are as reported by the relevant vendor; independent third-party reproduction had not been completed at the May 18 publish date, per The New Stack's coverage.
The critical benchmark-variant discipline: Opus 4.7's reported 87.6% on SWE-Bench Verified is a different evaluation surface from the 80.5% on SWE-Bench Multilingual cited here. Cross-variant comparison is methodological malpractice. Composer 2.5's 79.8% is measured on Multilingual — not Verified. For the full taxonomy, see our SWE-Bench vs Terminal-Bench methodology guide.
Benchmark comparison — Composer 2.5 vs Opus 4.7 vs GPT-5.5
Sources: cursor.com/blog/composer-2-5; datacamp.com/blog/composer-2-5; anthropic.com/news/claude-opus-4-7The pattern across three surfaces is consistent: Composer 2.5 sits within 0.7 points of Opus 4.7 on SWE-Bench Multilingual, ties Opus 4.7 within 0.1 points on Terminal-Bench 2.0, and leads at default CursorBench settings (though Opus 4.7 Adaptive edges ahead at 64.8% vs 63.2%). GPT-5.5 is the clear leader on Terminal-Bench at 82.7% — a 13.4-point gap that matters for shell-automation and infrastructure-as-code workloads.
The actionable interpretation is not "Composer 2.5 ties Opus 4.7 in absolute capability." It is: at 1/10 the per-token cost, Composer 2.5 can close out roughly the same agentic coding tasks that Opus 4.7 can, within a margin that many production workloads will find acceptable. That is the Pareto-frontier redrawn, not the capability frontier shifted.
04 — PRICINGStandard vs Fast: the routing-economics shift.
The Fast tier's 100% price increase from Composer 2 to 2.5 — from $1.50/$7.50 to $3.00/$15.00 — is the least-covered detail in the launch coverage, and arguably the most important for buyers budgeting at scale. Standard pricing is identical across both generations; the Fast tier doubled while delivering the same intelligence at lower latency.
Independent reviewers at PrimeAICenter describe the routing logic this way: "Fast is for interactive sessions, while Standard is for background work." Cursor is pricing capacity constraints into the Fast tier — interactive demand competes for lower-latency inference infrastructure that costs more to provision. Standard serves background agents and batch pipelines where latency is not the binding constraint.
Standard tier
Best for batch pipelines, background agents, PR review queues, nightly test-generation runs, and any task where latency is not the binding constraint. Same intelligence as Fast at 6x lower cost.
Fast tier
Best for live pair-programming, interactive Composer sessions, and latency-sensitive code completions. Same intelligence as Standard at 6x higher cost — the premium buys lower first-token latency, not more accuracy.
Launch week
Cursor doubles usage allocations for all subscribers during the first week of the Composer 2.5 launch. Effective halved cost for existing plans, but only through the promotional window.
Opus 4.7 reference
Anthropic API list rate for Claude Opus 4.7 with 1M context flat pricing. Composer 2.5 Standard is 10x cheaper on input and output. Opus 4.7 is available via API, Bedrock, and Vertex — Composer 2.5 is Cursor IDE only.
The practical guidance for most teams: default new background-agent workloads to Standard tier. Cursor's own framing — "Standard for background, Fast for interactive" — is reasonable guidance. For teams currently paying Opus 4.7 API rates for background batch jobs that run inside Cursor IDE, the routing shift to Composer 2.5 Standard can materially reduce per-task cost with minimal capability tradeoff.
05 — DELTASame base, 60 days, +11/+6/+7 points.
The most significant analytical signal in today's release is not the absolute benchmark position — it is the controlled comparison between Composer 2 (launched March 19, 2026) and Composer 2.5 (launched today, 60 days later). Both use the identical Kimi K2.5 base checkpoint. The only variables are Cursor's post-training methodology and data scale. This is the closest thing to a controlled A/B experiment that has appeared in the public record for the Composer 2 deep dive (2.5's direct predecessor) lineage.
Cursor's own disclosure: "Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." The synthetic-task methodology includes "feature deletion approaches grounded in real codebases" — deleting real features and training the model to reconstruct them, which grounds the synthetic distribution in production patterns rather than toy examples.
Composer 2 → Composer 2.5
Composer 2 launched March 19, 2026. Composer 2.5 launched May 18, 2026. Two months of post-training iteration on an identical open-weight base — the cleanest public evidence of post-training as a frontier-shifting axis.
More tasks than Composer 2
Cursor's direct disclosure: "Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." Tasks include feature deletion approaches grounded in real codebases. Scale of synthetic data is the dominant lever per DataCamp's analysis (~85% of total compute attributed to Cursor's post-training).
52.2% → 63.2% on CursorBench v3.1
The largest single-version delta in the Composer line's public history on any benchmark surface. ⚠ Vendor-controlled benchmark — Cursor builds and scores CursorBench. The 11-point gain is directionally significant but not independently verified.
73.7% → 79.8% on SWE-Bench Multilingual
The SWE-Bench Multilingual gain carries more weight than CursorBench because this benchmark is not controlled by Cursor. +6.1 percentage points on a third-party surface in 60 days, with a constant base model, is the strongest evidence available for the post-training-as-scaling-axis thesis.
To complete the controlled comparison: Terminal-Bench 2.0 improved by 7.6 points (61.7% to 69.3%) over the same 60-day window, with the same base model. All three benchmark surfaces moved in the same direction by 6-11 points. Holding the base constant is what makes this analytically useful: it isolates post-training methodology as the variable and rules out base-model improvement as an explanation.
The Claude Opus 4.7 complete guide provides the reference point for what Composer 2.5 is now competing against at the frontier level. The 60-day cadence also suggests Cursor has institutionalized a rapid post-training iteration cycle that most closed-model labs do not publicly disclose.
The same base model, 60 days of post-training iteration, and +6 to +11 percentage points across three benchmark surfaces — this is the clearest public demonstration that post-training compute is a genuine scaling axis.Digital Applied synthesis, May 18, 2026
06 — LOCK-INNo public API — Cursor IDE only.
Composer 2.5 is available exclusively inside the Cursor IDE. There is no public model API, no OpenRouter listing, no Bedrock integration, no Vertex deployment. This is a deliberate architectural and commercial choice, not a temporary gap.
The lock-in logic works in both directions. For buyers, it means the 1/10 cost ratio is only accessible if your development workflow runs inside Cursor IDE. A team using Claude Code, VS Code with the Copilot extension, or a custom IDE-agnostic pipeline cannot route to Composer 2.5 Standard at $0.50/$2.50 — they are paying Opus 4.7 API rates at minimum. For teams already inside Cursor, the cost math is straightforward and the switching cost is zero.
For infrastructure teams building agent scaffolding that runs outside the IDE — autonomous repo agents, CI/CD-triggered code review, nightly test-generation pipelines that don't run in an interactive IDE session — Composer 2.5 is not currently an option. This is the most significant capability constraint in the product. Our AI transformation engagement regularly evaluates exactly this tradeoff when teams are selecting coding model infrastructure.
Cursor IDE — exclusive distribution
Available only inside the Cursor IDE. Standard ($0.50/$2.50) and Fast ($3.00/$15.00) tiers. No public API, no Bedrock, no Vertex, no OpenRouter. The cheapest frontier coding model on the market — but only if your workflow lives in Cursor.
Full API + IDE integrations
Available via Anthropic API, Amazon Bedrock, Google Vertex AI, and through Cursor IDE (as a BYOK option). $5.00 in / $25.00 out per Mtok with 1M context flat pricing. Maximum distribution flexibility — works in any IDE, any agent scaffold, any custom pipeline.
OpenAI API + Foundry Local
Available via OpenAI API and Microsoft Azure Foundry. $5.00 in / $30.00 out per Mtok below 272K tokens (2x input surcharge above). Also available through GitHub Copilot integration. Terminal-Bench leader at 82.7%. On-premises deployment now available via Dell AI Factory partnership (May 18, 2026).
The choice matrix above simplifies distribution to a decision between three models. In practice, most enterprise teams will run all three — routing by task type and cost profile. The key constraint is that Composer 2.5 can only participate in IDE-bound workflows at launch. Any agent pipeline running outside the IDE must budget for Opus 4.7 or GPT-5.5 API rates.
07 — ROADMAPThe SpaceX deal is about Composer 3, not Composer 2.5.
Cursor's Composer 2.5 announcement includes a section on the SpaceX partnership that has been widely misread in press coverage. The announcement reads: "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute. With Colossus 2's million H100-equivalents and our combined data and training techniques, we expect this to be a major leap in model capability."
The verb tense is unambiguous: "we're training" — future continuous. Composer 2.5 was trained on Cursor's existing RL pipeline. Colossus 2 and the ~$60B SpaceX acquisition option (reportedly including a $10B breakup fee, per TechCrunch's April 22 reporting) are infrastructure for a model trained from scratch — which is what Composer 3 will be, not Composer 2.5.
The scale-up is also qualitatively different. Composer 2.5 is a post-training improvement on an existing open-weight checkpoint. Composer 3 is described as "a significantly larger model" trained "from scratch" with "10x more total compute." That implies a new base model at a larger parameter scale, trained with Cursor's post-training methodology baked in from the start, on infrastructure that dwarfs what any single commercial AI lab has deployed independently. The timeline is undisclosed — Cursor says "we're training" with no committed ship date.
08 — GOVERNANCECursorBench needs the same skepticism we applied to SWE-Bench.
CursorBench v3.1 is built, maintained, and scored by Cursor — the same entity that authors Composer 2.5. This is the most significant governance concern in interpreting today's benchmark claims. Every row in the CursorBench leaderboard — including the Opus 4.7 and GPT-5.5 comparator rows — is evaluated through Cursor's harness, on tasks Cursor selected, with a scoring methodology Cursor controls.
DataCamp's coverage discloses a related asymmetry: "Terminal-Bench and SWE-Bench Multilingual scores for competitors are self-reported from Anthropic and OpenAI respectively." That means the three-surface comparison in Section 03 is a hybrid: Cursor-measured CursorBench for all three models, plus Anthropic-self-reported Multilingual and Anthropic-self-reported Terminal-Bench for Opus 4.7, plus OpenAI-self-reported numbers for GPT-5.5. No single surface has been independently reproduced under a unified scaffold across all three models as of the May 18 publish date.
This is not unique to Cursor or Composer 2.5 — it mirrors the governance problems in the broader SWE-Bench ecosystem we analyzed in detail. The appropriate response is not to dismiss the numbers, but to weight them proportionally: third-party benchmarks (Multilingual, Terminal-Bench) carry more weight than vendor-controlled benchmarks (CursorBench) when making procurement decisions.
09 — ROUTINGTask-routing against Opus 4.7 and GPT-5.5.
Given the benchmark profile and distribution constraints, the practical routing question is: which model for which task, at what cost? The matrix below captures the decision for the three most common agentic coding task classes. All cost figures are based on published list prices as of May 18, 2026.
Composer 2.5 Standard
Background agents, PR review queues, test generation, multi-file refactors where latency is not the binding constraint, and any Cursor-IDE workflow where cost matters. 10x cheaper than Opus 4.7 for parity-level agentic coding quality on Multilingual and Terminal-Bench.
Claude Opus 4.7
Long-context reasoning, non-IDE agent pipelines, complex backend coherence over very long tasks, Bedrock/Vertex deployment, and any workflow that runs outside Cursor IDE. Marginally ahead on Multilingual (80.5% vs 79.8%) and Terminal-Bench (69.4% vs 69.3%) at 10x the cost.
GPT-5.5 standard
Terminal and shell automation, infrastructure-as-code tasks, CLI scaffolding. GPT-5.5 leads Terminal-Bench 2.0 at 82.7% vs 69.3% for Composer 2.5 — a 13.4-point gap that matters for shell-heavy workloads. Now also available on-premises via Dell AI Factory (May 18, 2026).
The routing recommendation across all three is straightforward: if the task runs inside Cursor IDE and tolerates ~0.7pt lower accuracy on Multilingual, Composer 2.5 Standard is the economically dominant choice. If it runs outside the IDE, or requires terminal/shell mastery, or involves very-long-context coherence that community reviewers have flagged as a Composer 2.5 weakness, default to Opus 4.7 or GPT-5.5 respectively.
10 — OUTLOOKThe open-weight-base plus proprietary-RL pattern as the new commercial template.
Composer 2.5's architecture is not novel in concept — the idea of building on open-weight foundations and differentiating through post-training dates to OpenAI's o1 in September 2024 and Cursor's own Composer 1.5 nine months later. What Composer 2.5 provides that earlier releases did not is the cleanest controlled evidence: same base, 60 days, +6 to +11 points. Most labs cannot run that A/B because they change the base model between releases. Cursor held the base constant.
The commercial template it establishes is: take a permissively-licensed open-weight MoE (Kimi K2.5 in this case, Modified MIT license), invest the dominant share of compute in task-specific RL post-training, and distribute the result exclusively through a proprietary IDE or toolchain. This inverts the standard model-as-commodity assumption. The base is the commodity; the post-training is the moat; the distribution channel is the lock-in mechanism.
The pattern predicts Q3 and Q4 2026 will see multiple vendors attempt the same three-layer play. The variable that matters is not which open-weight base they choose — those are increasingly interchangeable as Modified MIT and Apache 2.0 licenses proliferate — but how sophisticated their post-training methodology is and how defensible their distribution channel proves to be. Cursor's IDE moat has held so far. Whether it holds as Claude Code, Codex CLI, and Aider compete for the same workflow is the strategic question that today's release does not yet answer. For the broader context on where Cursor sits in that competitive landscape, see our Composer 2 deep dive — 2.5's direct predecessor.
Post-training compute is the axis — and Composer 2.5 is the clearest proof.
Composer 2.5 is the most analytically useful commercial model release of the quarter — not because it claims the largest absolute benchmark numbers, but because it holds a variable constant. Same Kimi K2.5 base. 60 days of proprietary RL. +11/+6/+7 points on three benchmark surfaces. That is the controlled A/B experiment that the open-weight-RL community has been waiting for, and it arrives with enough primary-source disclosure (25x synthetic tasks, targeted textual feedback, Sharded Muon at 0.2-second step time on a 1T-parameter model) to be scrutinized rather than simply cited.
The lock-in tradeoff is the real decision for buyers — per-token economics is the easier half. Composer 2.5 Standard's $0.50/$2.50 rate only matters if your development workflow is already inside Cursor IDE. Teams with IDE-agnostic agent pipelines, multi-vendor routing infrastructure, or non-Cursor toolchain investments are not realizing the cost advantage. The benchmark parity is real; the distribution constraint is equally real. Evaluate both before committing workload.
The forward trajectory is clearer than the current release: Composer 3, trained from scratch on Colossus 2's ~1M H100-equivalents at 10x the compute of 2.5, is the next test of the open-weight-base-plus-proprietary-RL pattern. Composer 2.5 proves the pattern works at one compute scale. Composer 3 will test whether base scaling and post-training scaling can be pushed simultaneously — and whether the result can justify the reportedly $60B acquisition option SpaceX holds on Cursor.