Cursor shipped Composer 2.5 earlier today — May 18, 2026 — on the same Moonshot Kimi K2.5 open-weight checkpoint that powered Composer 2, with a second-generation reinforcement learning layer that benchmarks suggest matches Claude Opus 4.7 on SWE-Bench Multilingual at $0.50/$2.50 per million tokens, approximately one-tenth the per-token cost of Opus 4.7's $5/$25 standard tier.

The cost story is straightforward on paper. The governance story is less so: the headline CursorBench v3.1 number (63.2%) is built, run, and scored by Cursor itself — the same entity authoring the model. Disciplined buyers will weight the SWE-Bench Multilingual and Terminal-Bench 2.0 numbers more heavily, since those benchmarks are not Cursor-controlled. And on those third-party surfaces, the Composer 2.5 result is genuinely compelling: 79.8% on Multilingual (0.7 points behind Opus 4.7's 80.5%) and 69.3% on Terminal-Bench (essentially tied with Opus 4.7's 69.4%), at a fraction of the price.

This guide covers what shipped today, the open-weight-base plus proprietary-RL architecture that produced it, the controlled A/B experiment embedded in the Composer 2 to 2.5 transition, the lock-in tradeoff of a Cursor-IDE-only distribution model, and what the SpaceX partnership actually signals about the roadmap — which is Composer 3, not this release.

Key takeaways

01
Same base, 60 days, materially better benchmarks.Composer 2.5 uses the identical Kimi K2.5 checkpoint as Composer 2. The +11.0/+6.1/+7.6 percentage-point gains on CursorBench, SWE-Bench Multilingual, and Terminal-Bench are the product of 25x more synthetic training tasks and a targeted-textual-feedback RL approach — not a new base model.
02
Standard pricing held flat; Fast tier doubled.Standard remains $0.50 in / $2.50 out per Mtok — identical to Composer 2. Fast tier jumped from $1.50/$7.50 to $3.00/$15.00, a 100% increase. Cursor is pricing routing economics into the tiers as Fast becomes the default for interactive sessions and Standard handles background batch work.
03
CursorBench is vendor-controlled — treat accordingly.Cursor builds and scores CursorBench. The 63.2% score for Composer 2.5 outperforms Opus 4.7 at default settings (61.6%), but Opus 4.7 Adaptive reaches 64.8%. Weight SWE-Bench Multilingual and Terminal-Bench 2.0 numbers more heavily — those benchmarks are not controlled by Cursor.
04
No public API — Cursor IDE only at launch.Composer 2.5 is not exposed via any third-party API surface. Not on OpenRouter, not on Bedrock, not on Vertex. It is available exclusively inside the Cursor IDE. This is a deliberate distribution choice that inverts the standard model-as-commodity pattern.
05
SpaceX means Composer 3, not Composer 2.5.Cursor's own announcement is unambiguously future-tense: "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute." Composer 2.5 runs on Cursor's existing RL pipeline. Colossus 2's ~1M H100-equivalents are the infrastructure for the next-generation model.

01 — LAUNCHWhat shipped today: pricing, base model, and the 1/10 framing.

Cursor's May 18 release announcement — published as cursor.com/blog/composer-2-5 — describes Composer 2.5 as "frontier-level at coding" at $0.50/M input and $2.50/M output tokens, calling it "a new, optimal combination of intelligence and cost." Two tiers ship simultaneously: Standard at $0.50/$2.50 and Fast at $3.00/$15.00. A launch-week promotion doubles usage allocations for Cursor subscribers for the first seven days.

The 1/10 cost framing is precise on per-token list price. Opus 4.7 sits at $5.00 in / $25.00 out per Mtok (1M context, flat pricing per Anthropic's April 16 announcement). Composer 2.5 Standard is exactly 10x cheaper on both input and output. GPT-5.5 sits at $5.00 in / $30.00 out per Mtok below 272K tokens — with a 2x input surcharge above that threshold — making Composer 2.5 Standard 10x cheaper on input and 12x cheaper on output against the GPT-5.5 list rate.

The base model is disclosed in the announcement: Moonshot AI's Kimi K2.5, a 1.04-trillion-parameter Mixture-of-Experts model (32B activated per token) released January 29, 2026 under a Modified MIT license. This is the same open-weight checkpoint that powered Composer 2 — a detail that becomes important when interpreting the benchmark deltas in Section 05.

02 — ARCHITECTUREOpen-weight base plus proprietary reinforcement learning.

The Composer line embodies what is becoming the dominant frontier-coding architecture: take a permissively-licensed open-weight base model — in this case Kimi K2.5 — and layer proprietary post-training on top. Cursor has disclosed more about this pipeline with each release. For Composer 1.5: the "20x RL scale-up" disclosure established that the bulk of differentiation sits in post-training, not pre-training. DataCamp's analysis of the Composer 2.5 announcement estimates that roughly 85% of total compute for the final model comes from Cursor's post-training work — a striking ratio if accurate, though that specific figure is DataCamp's interpretation rather than a direct Cursor disclosure.

What Cursor does disclose directly for Composer 2.5: the model is trained with "targeted textual feedback" — short text hints inserted at exact decision points where the model went off-trajectory during RL rollouts, rather than scoring only the final outcome. This addresses a fundamental challenge Cursor identifies in their announcement: "credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens." Scoring a final diff is easy; knowing which of 200 intermediate steps caused the failure is not.

The training infrastructure uses a Sharded Muon optimizer with dual mesh HSDP, achieving a 0.2-second step time on the ~1T-parameter model. Muon is the same optimizer family Moonshot used to pretrain the original Kimi K2 at unprecedented efficiency — Cursor's adoption of it for post-training at this scale is a meaningful data point in the post-training is the new moat thesis.

Architecture pattern

The open-weight base plus proprietary RL split is now the dominant commercial template for frontier coding models. Cursor provides the clearest public evidence of the pattern: the same Kimi K2.5 base, applied twice, 60 days apart, with the only variable being the depth and methodology of Cursor's post-training. The base model is commoditized input; the RL layer is the proprietary IP. For a deeper look at the base, see our Kimi K2.5 architecture deep dive.

03 — BENCHMARKSCursorBench, SWE-Bench Multilingual, and Terminal-Bench numbers.

Three benchmark surfaces are reported in the Composer 2.5 announcement and cross-referenced in DataCamp's independent coverage. Section 08 addresses the governance flags in detail. The chart below shows Composer 2.5 alongside Opus 4.7 (at its best reported settings) and GPT-5.5 (at its best reported settings) on each surface. All percentages are as reported by the relevant vendor; independent third-party reproduction had not been completed at the May 18 publish date, per The New Stack's coverage.

The critical benchmark-variant discipline: Opus 4.7's reported 87.6% on SWE-Bench Verified is a different evaluation surface from the 80.5% on SWE-Bench Multilingual cited here. Cross-variant comparison is methodological malpractice. Composer 2.5's 79.8% is measured on Multilingual — not Verified. For the full taxonomy, see our SWE-Bench vs Terminal-Bench methodology guide.

Benchmark comparison — Composer 2.5 vs Opus 4.7 vs GPT-5.5

Sources: cursor.com/blog/composer-2-5; datacamp.com/blog/composer-2-5; anthropic.com/news/claude-opus-4-7

CursorBench v3.1 — Composer 2.5⚠ Vendor-controlled — Cursor builds and scores this benchmark

63.2%

CursorBench v3.1 — Opus 4.7 AdaptiveCursor-run measurement of competitor model at max reasoning

64.8%

CursorBench v3.1 — GPT-5.5 xhighCursor-run measurement of competitor model at max reasoning

64.3%

SWE-Bench Multilingual — Composer 2.5Vendor-reported; different surface from SWE-Bench Verified

79.8%

SWE-Bench Multilingual — Opus 4.7Anthropic-reported; 0.7 pts ahead of Composer 2.5

80.5%

SWE-Bench Multilingual — GPT-5.5OpenAI-reported; 2 pts behind Composer 2.5

77.8%

Terminal-Bench 2.0 — Composer 2.5Stanford + Laude Institute benchmark; near-ties Opus 4.7

69.3%

Terminal-Bench 2.0 — Opus 4.7Anthropic-reported; 0.1 pts ahead of Composer 2.5

69.4%

Terminal-Bench 2.0 — GPT-5.5OpenAI-reported; significantly ahead on terminal/shell tasks

82.7%

The pattern across three surfaces is consistent: Composer 2.5 sits within 0.7 points of Opus 4.7 on SWE-Bench Multilingual, ties Opus 4.7 within 0.1 points on Terminal-Bench 2.0, and leads at default CursorBench settings (though Opus 4.7 Adaptive edges ahead at 64.8% vs 63.2%). GPT-5.5 is the clear leader on Terminal-Bench at 82.7% — a 13.4-point gap that matters for shell-automation and infrastructure-as-code workloads.

The actionable interpretation is not "Composer 2.5 ties Opus 4.7 in absolute capability." It is: at 1/10 the per-token cost, Composer 2.5 can close out roughly the same agentic coding tasks that Opus 4.7 can, within a margin that many production workloads will find acceptable. That is the Pareto-frontier redrawn, not the capability frontier shifted.

04 — PRICINGStandard vs Fast: the routing-economics shift.

The Fast tier's 100% price increase from Composer 2 to 2.5 — from $1.50/$7.50 to $3.00/$15.00 — is the least-covered detail in the launch coverage, and arguably the most important for buyers budgeting at scale. Standard pricing is identical across both generations; the Fast tier doubled while delivering the same intelligence at lower latency.

Independent reviewers at PrimeAICenter describe the routing logic this way: "Fast is for interactive sessions, while Standard is for background work." Cursor is pricing capacity constraints into the Fast tier — interactive demand competes for lower-latency inference infrastructure that costs more to provision. Standard serves background agents and batch pipelines where latency is not the binding constraint.

Background agents

Standard tier

$0.50 in / $2.50 out per Mtok

Best for batch pipelines, background agents, PR review queues, nightly test-generation runs, and any task where latency is not the binding constraint. Same intelligence as Fast at 6x lower cost.

Held flat vs Composer 2

Interactive sessions

Fast tier

$3.00 in / $15.00 out per Mtok

Best for live pair-programming, interactive Composer sessions, and latency-sensitive code completions. Same intelligence as Standard at 6x higher cost — the premium buys lower first-token latency, not more accuracy.

+100% vs Composer 2 Fast

First-week promo

Launch week

Double usage allocation

Cursor doubles usage allocations for all subscribers during the first week of the Composer 2.5 launch. Effective halved cost for existing plans, but only through the promotional window.

First 7 days only

API pricing reference

Opus 4.7 reference

$5.00 in / $25.00 out per Mtok

Anthropic API list rate for Claude Opus 4.7 with 1M context flat pricing. Composer 2.5 Standard is 10x cheaper on input and output. Opus 4.7 is available via API, Bedrock, and Vertex — Composer 2.5 is Cursor IDE only.

10x more expensive than Composer 2.5 Standard

The practical guidance for most teams: default new background-agent workloads to Standard tier. Cursor's own framing — "Standard for background, Fast for interactive" — is reasonable guidance. For teams currently paying Opus 4.7 API rates for background batch jobs that run inside Cursor IDE, the routing shift to Composer 2.5 Standard can materially reduce per-task cost with minimal capability tradeoff.

05 — DELTASame base, 60 days, +11/+6/+7 points.

The most significant analytical signal in today's release is not the absolute benchmark position — it is the controlled comparison between Composer 2 (launched March 19, 2026) and Composer 2.5 (launched today, 60 days later). Both use the identical Kimi K2.5 base checkpoint. The only variables are Cursor's post-training methodology and data scale. This is the closest thing to a controlled A/B experiment that has appeared in the public record for the Composer 2 deep dive (2.5's direct predecessor) lineage.

Cursor's own disclosure: "Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." The synthetic-task methodology includes "feature deletion approaches grounded in real codebases" — deleting real features and training the model to reconstruct them, which grounds the synthetic distribution in production patterns rather than toy examples.

Time elapsed

Composer 2 → Composer 2.5

60days

Composer 2 launched March 19, 2026. Composer 2.5 launched May 18, 2026. Two months of post-training iteration on an identical open-weight base — the cleanest public evidence of post-training as a frontier-shifting axis.

Same Kimi K2.5 base throughout

Synthetic data

More tasks than Composer 2

25×

Cursor's direct disclosure: "Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." Tasks include feature deletion approaches grounded in real codebases. Scale of synthetic data is the dominant lever per DataCamp's analysis (~85% of total compute attributed to Cursor's post-training).

Feature deletion + real codebase grounding

CursorBench gain

52.2% → 63.2% on CursorBench v3.1

+11.0pts

The largest single-version delta in the Composer line's public history on any benchmark surface. ⚠ Vendor-controlled benchmark — Cursor builds and scores CursorBench. The 11-point gain is directionally significant but not independently verified.

⚠ Vendor-controlled

Multilingual gain

73.7% → 79.8% on SWE-Bench Multilingual

+6.1pts

The SWE-Bench Multilingual gain carries more weight than CursorBench because this benchmark is not controlled by Cursor. +6.1 percentage points on a third-party surface in 60 days, with a constant base model, is the strongest evidence available for the post-training-as-scaling-axis thesis.

Third-party benchmark — more weight

To complete the controlled comparison: Terminal-Bench 2.0 improved by 7.6 points (61.7% to 69.3%) over the same 60-day window, with the same base model. All three benchmark surfaces moved in the same direction by 6-11 points. Holding the base constant is what makes this analytically useful: it isolates post-training methodology as the variable and rules out base-model improvement as an explanation.

The Claude Opus 4.7 complete guide provides the reference point for what Composer 2.5 is now competing against at the frontier level. The 60-day cadence also suggests Cursor has institutionalized a rapid post-training iteration cycle that most closed-model labs do not publicly disclose.

The same base model, 60 days of post-training iteration, and +6 to +11 percentage points across three benchmark surfaces — this is the clearest public demonstration that post-training compute is a genuine scaling axis.Digital Applied synthesis, May 18, 2026

06 — LOCK-INNo public API — Cursor IDE only.

Composer 2.5 is available exclusively inside the Cursor IDE. There is no public model API, no OpenRouter listing, no Bedrock integration, no Vertex deployment. This is a deliberate architectural and commercial choice, not a temporary gap.

The lock-in logic works in both directions. For buyers, it means the 1/10 cost ratio is only accessible if your development workflow runs inside Cursor IDE. A team using Claude Code, VS Code with the Copilot extension, or a custom IDE-agnostic pipeline cannot route to Composer 2.5 Standard at $0.50/$2.50 — they are paying Opus 4.7 API rates at minimum. For teams already inside Cursor, the cost math is straightforward and the switching cost is zero.

For infrastructure teams building agent scaffolding that runs outside the IDE — autonomous repo agents, CI/CD-triggered code review, nightly test-generation pipelines that don't run in an interactive IDE session — Composer 2.5 is not currently an option. This is the most significant capability constraint in the product. Our AI transformation engagement regularly evaluates exactly this tradeoff when teams are selecting coding model infrastructure.

Cursor Composer 2.5

Cursor IDE — exclusive distribution

Available only inside the Cursor IDE. Standard ($0.50/$2.50) and Fast ($3.00/$15.00) tiers. No public API, no Bedrock, no Vertex, no OpenRouter. The cheapest frontier coding model on the market — but only if your workflow lives in Cursor.

Cursor IDE only

Claude Opus 4.7

Full API + IDE integrations

Available via Anthropic API, Amazon Bedrock, Google Vertex AI, and through Cursor IDE (as a BYOK option). $5.00 in / $25.00 out per Mtok with 1M context flat pricing. Maximum distribution flexibility — works in any IDE, any agent scaffold, any custom pipeline.

API + Bedrock + Vertex

GPT-5.5

OpenAI API + Foundry Local

Available via OpenAI API and Microsoft Azure Foundry. $5.00 in / $30.00 out per Mtok below 272K tokens (2x input surcharge above). Also available through GitHub Copilot integration. Terminal-Bench leader at 82.7%. On-premises deployment now available via Dell AI Factory partnership (May 18, 2026).

API + Azure + Foundry Local

The choice matrix above simplifies distribution to a decision between three models. In practice, most enterprise teams will run all three — routing by task type and cost profile. The key constraint is that Composer 2.5 can only participate in IDE-bound workflows at launch. Any agent pipeline running outside the IDE must budget for Opus 4.7 or GPT-5.5 API rates.

07 — ROADMAPThe SpaceX deal is about Composer 3, not Composer 2.5.

Cursor's Composer 2.5 announcement includes a section on the SpaceX partnership that has been widely misread in press coverage. The announcement reads: "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute. With Colossus 2's million H100-equivalents and our combined data and training techniques, we expect this to be a major leap in model capability."

The verb tense is unambiguous: "we're training" — future continuous. Composer 2.5 was trained on Cursor's existing RL pipeline. Colossus 2 and the ~$60B SpaceX acquisition option (reportedly including a $10B breakup fee, per TechCrunch's April 22 reporting) are infrastructure for a model trained from scratch — which is what Composer 3 will be, not Composer 2.5.

The scale-up is also qualitatively different. Composer 2.5 is a post-training improvement on an existing open-weight checkpoint. Composer 3 is described as "a significantly larger model" trained "from scratch" with "10x more total compute." That implies a new base model at a larger parameter scale, trained with Cursor's post-training methodology baked in from the start, on infrastructure that dwarfs what any single commercial AI lab has deployed independently. The timeline is undisclosed — Cursor says "we're training" with no committed ship date.

Colossus 2 disambiguation

Composer 2.5 was trained on Cursor's existing pipeline — not on Colossus 2. The SpaceX partnership and Colossus 2's ~1M H100-equivalent cluster are earmarked for Composer 3: a new model trained from scratch at 10x the compute of 2.5. Do not conflate the two. Cursor's own announcement, published today, uses future-tense language for the Colossus 2 work. The Cursor × SpaceX training announcement is the primary source.

08 — GOVERNANCECursorBench needs the same skepticism we applied to SWE-Bench.

CursorBench v3.1 is built, maintained, and scored by Cursor — the same entity that authors Composer 2.5. This is the most significant governance concern in interpreting today's benchmark claims. Every row in the CursorBench leaderboard — including the Opus 4.7 and GPT-5.5 comparator rows — is evaluated through Cursor's harness, on tasks Cursor selected, with a scoring methodology Cursor controls.

DataCamp's coverage discloses a related asymmetry: "Terminal-Bench and SWE-Bench Multilingual scores for competitors are self-reported from Anthropic and OpenAI respectively." That means the three-surface comparison in Section 03 is a hybrid: Cursor-measured CursorBench for all three models, plus Anthropic-self-reported Multilingual and Anthropic-self-reported Terminal-Bench for Opus 4.7, plus OpenAI-self-reported numbers for GPT-5.5. No single surface has been independently reproduced under a unified scaffold across all three models as of the May 18 publish date.

This is not unique to Cursor or Composer 2.5 — it mirrors the governance problems in the broader SWE-Bench ecosystem we analyzed in detail. The appropriate response is not to dismiss the numbers, but to weight them proportionally: third-party benchmarks (Multilingual, Terminal-Bench) carry more weight than vendor-controlled benchmarks (CursorBench) when making procurement decisions.

Governance flag

CursorBench v3.1 is built and run by Cursor. Opus 4.7's 64.8% CursorBench score (Adaptive) exceeds Composer 2.5's 63.2% — a fact Cursor discloses in its own announcement. The "leader per dollar" framing is accurate on per-token economics; "absolute benchmark leader" is not. Weight SWE-Bench Multilingual (79.8% vs 80.5% — 0.7 pts behind Opus 4.7) and Terminal-Bench 2.0 (69.3% vs 69.4% — 0.1 pts behind Opus 4.7) more heavily in any procurement evaluation, as neither benchmark is controlled by Cursor. Independent reproduction on a unified scaffold had not been completed at the May 18 launch date.

09 — ROUTINGTask-routing against Opus 4.7 and GPT-5.5.

Given the benchmark profile and distribution constraints, the practical routing question is: which model for which task, at what cost? The matrix below captures the decision for the three most common agentic coding task classes. All cost figures are based on published list prices as of May 18, 2026.

Best for

Composer 2.5 Standard

$0.50 in / $2.50 out per Mtok

Background agents, PR review queues, test generation, multi-file refactors where latency is not the binding constraint, and any Cursor-IDE workflow where cost matters. 10x cheaper than Opus 4.7 for parity-level agentic coding quality on Multilingual and Terminal-Bench.

Cursor IDE only

Best for

Claude Opus 4.7

$5.00 in / $25.00 out per Mtok (1M context flat)

Long-context reasoning, non-IDE agent pipelines, complex backend coherence over very long tasks, Bedrock/Vertex deployment, and any workflow that runs outside Cursor IDE. Marginally ahead on Multilingual (80.5% vs 79.8%) and Terminal-Bench (69.4% vs 69.3%) at 10x the cost.

API + Bedrock + Vertex

Best for

GPT-5.5 standard

$5.00 in / $30.00 out per Mtok (below 272K)

Terminal and shell automation, infrastructure-as-code tasks, CLI scaffolding. GPT-5.5 leads Terminal-Bench 2.0 at 82.7% vs 69.3% for Composer 2.5 — a 13.4-point gap that matters for shell-heavy workloads. Now also available on-premises via Dell AI Factory (May 18, 2026).

Terminal-Bench leader

The routing recommendation across all three is straightforward: if the task runs inside Cursor IDE and tolerates ~0.7pt lower accuracy on Multilingual, Composer 2.5 Standard is the economically dominant choice. If it runs outside the IDE, or requires terminal/shell mastery, or involves very-long-context coherence that community reviewers have flagged as a Composer 2.5 weakness, default to Opus 4.7 or GPT-5.5 respectively.

10 — OUTLOOKThe open-weight-base plus proprietary-RL pattern as the new commercial template.

Composer 2.5's architecture is not novel in concept — the idea of building on open-weight foundations and differentiating through post-training dates to OpenAI's o1 in September 2024 and Cursor's own Composer 1.5 nine months later. What Composer 2.5 provides that earlier releases did not is the cleanest controlled evidence: same base, 60 days, +6 to +11 points. Most labs cannot run that A/B because they change the base model between releases. Cursor held the base constant.

The commercial template it establishes is: take a permissively-licensed open-weight MoE (Kimi K2.5 in this case, Modified MIT license), invest the dominant share of compute in task-specific RL post-training, and distribute the result exclusively through a proprietary IDE or toolchain. This inverts the standard model-as-commodity assumption. The base is the commodity; the post-training is the moat; the distribution channel is the lock-in mechanism.

The pattern predicts Q3 and Q4 2026 will see multiple vendors attempt the same three-layer play. The variable that matters is not which open-weight base they choose — those are increasingly interchangeable as Modified MIT and Apache 2.0 licenses proliferate — but how sophisticated their post-training methodology is and how defensible their distribution channel proves to be. Cursor's IDE moat has held so far. Whether it holds as Claude Code, Codex CLI, and Aider compete for the same workflow is the strategic question that today's release does not yet answer. For the broader context on where Cursor sits in that competitive landscape, see our Composer 2 deep dive — 2.5's direct predecessor.

Conclusion

Post-training compute is the axis — and Composer 2.5 is the clearest proof.

Composer 2.5 is the most analytically useful commercial model release of the quarter — not because it claims the largest absolute benchmark numbers, but because it holds a variable constant. Same Kimi K2.5 base. 60 days of proprietary RL. +11/+6/+7 points on three benchmark surfaces. That is the controlled A/B experiment that the open-weight-RL community has been waiting for, and it arrives with enough primary-source disclosure (25x synthetic tasks, targeted textual feedback, Sharded Muon at 0.2-second step time on a 1T-parameter model) to be scrutinized rather than simply cited.

The lock-in tradeoff is the real decision for buyers — per-token economics is the easier half. Composer 2.5 Standard's $0.50/$2.50 rate only matters if your development workflow is already inside Cursor IDE. Teams with IDE-agnostic agent pipelines, multi-vendor routing infrastructure, or non-Cursor toolchain investments are not realizing the cost advantage. The benchmark parity is real; the distribution constraint is equally real. Evaluate both before committing workload.

The forward trajectory is clearer than the current release: Composer 3, trained from scratch on Colossus 2's ~1M H100-equivalents at 10x the compute of 2.5, is the next test of the open-weight-base-plus-proprietary-RL pattern. Composer 2.5 proves the pattern works at one compute scale. Composer 3 will test whether base scaling and post-training scaling can be pushed simultaneously — and whether the result can justify the reportedly $60B acquisition option SpaceX holds on Cursor.

Composer 2.5: Agent Coding at 1/10 the Cost

01 — LAUNCHWhat shipped today: pricing, base model, and the 1/10 framing.

02 — ARCHITECTUREOpen-weight base plus proprietary reinforcement learning.

03 — BENCHMARKSCursorBench, SWE-Bench Multilingual, and Terminal-Bench numbers.

Benchmark comparison — Composer 2.5 vs Opus 4.7 vs GPT-5.5

04 — PRICINGStandard vs Fast: the routing-economics shift.

Standard tier

Fast tier

Launch week

Opus 4.7 reference

05 — DELTASame base, 60 days, +11/+6/+7 points.

Composer 2 → Composer 2.5

More tasks than Composer 2

52.2% → 63.2% on CursorBench v3.1

73.7% → 79.8% on SWE-Bench Multilingual

06 — LOCK-INNo public API — Cursor IDE only.

Cursor IDE — exclusive distribution

Full API + IDE integrations

OpenAI API + Foundry Local

07 — ROADMAPThe SpaceX deal is about Composer 3, not Composer 2.5.

08 — GOVERNANCECursorBench needs the same skepticism we applied to SWE-Bench.

09 — ROUTINGTask-routing against Opus 4.7 and GPT-5.5.

Composer 2.5 Standard

Claude Opus 4.7

GPT-5.5 standard

10 — OUTLOOKThe open-weight-base plus proprietary-RL pattern as the new commercial template.

Post-training compute is the axis — and Composer 2.5 is the clearest proof.

At 1/10 the cost of Opus 4.7, Composer 2.5 changes the routing math.

Agentic coding model engagements

The questions we get about Composer 2.5.

Continue exploring frontier coding releases.

Qwen 3.7 Max: Alibaba's New Flagship AI Model 2026

Agentic Coding in H2 2026: What Ships Next After I/O

AI Agent Stack Decision Tree: Team Routing 2026 Guide

Cursor Composer 2.5 vs Claude Code: When to Use Which

CursorBench v3.1 Explained: Inside the Vendor Benchmark

AI Coding Agent Cost Calculator: 10 Tools Compared 2026