Reinforcement learning post-training has quietly become the primary scaling axis at every frontier AI lab — and Cursor's September 2025 disclosure for Composer 1.5 is the only case where a lab has written, in plain language, that post-training compute exceeded the compute used to pretrain the base model. That single sourced fact reframes a year of o-series launches, Constitutional AI papers, and GRPO experiments as convergent evidence for the same thesis: the moat has moved.

The stakes are real. Labs that can run large-scale RL post-training loops on proprietary task distributions — coding, math, agent trajectories — are pulling away from labs that compete only on pretraining scale. Base models trained on the same public data are becoming commodities. The differentiation is in the reward signal, the RL algorithm, and the compute cluster that runs the loop.

This guide covers the Cursor Composer 1.5 disclosure, the SpaceX × Cursor infrastructure deal, the imminent Composer 2.5 launch scheduled for tomorrow (May 18), the OpenAI o-series train-time and test-time compute curves, Anthropic's Constitutional AI and RLAIF pipeline driving Opus 4.7's benchmark jump, DeepSeek-R1 and GRPO as the cheaper RL path, and a five-vendor disclosure matrix that no other coverage has compiled in one place.

Key takeaways

01
Cursor Composer 1.5 is the only quantified disclosure.Cursor wrote explicitly that post-training compute surpassed pretraining compute for Composer 1.5, with a 20x RL scale-up on the same base model. Every other frontier lab gestures at this qualitatively. That asymmetry makes Composer 1.5 the clean anchor fact for the inversion thesis.
02
The SpaceX × Cursor deal is an RL-infrastructure story.Most coverage frames the $60B option as a Cursor valuation event. The more important read: SpaceX is giving Cursor access to xAI Colossus (~1M H100s as widely reported) specifically because post-training RL is now the compute bottleneck — base-model size is not.
03
OpenAI, Anthropic, and DeepSeek all confirm the shift.OpenAI’s o1 blog publishes scaling curves showing accuracy improving log-linearly with both train-time RL compute and test-time inference compute. Anthropic’s Constitutional AI is its post-training spine. DeepSeek-R1 demonstrated reasoning emerging from pure RL with GRPO.
04
Opus 4.7’s +6.8-point SWE-Bench Verified gain is post-training-driven.Anthropic’s Opus 4.7 (April 16, 2026) moved from 80.8% to 87.6% on SWE-Bench Verified and from 53.4% to 64.3% on SWE-Bench Pro. The improvement is attributed to post-training (CAI + RL), not a base-model size increase.
05
The next 12 months will be defined by RL-compute consolidation.More $60B-class infrastructure deals are likely. The labs that can afford a continuous RL flywheel — proprietary task distributions, reward models, and the compute to run the loop at scale — are pulling ahead of those that can only compete on pretraining.

01 — The InversionWhen post-training compute exceeds pretraining.

The conventional model-development economics ran in one direction: spend most of the compute budget on pretraining a large base model on internet-scale data, then apply a relatively cheap alignment pass (SFT followed by RLHF) to make the model usable. Post-training was the finishing coat, not the structural work.

That ratio has inverted at the frontier. The base model is now a starting point — a well-conditioned weight initialization. The real differentiation happens in what you do to it afterward: what task distribution you train on, what reward signal you design, which RL algorithm you use to optimize against that signal, and how much compute you allocate to the loop. The InstructGPT paper (Ouyang et al., 2022) productionized the SFT → reward model → PPO pipeline. But the compute ratios described in that paper look nothing like what frontier labs are running today.

The signal worth tracking is not which lab has the largest base model — those numbers are converging as efficient architectures spread. The signal is which lab can run the most capable RL post-training loop on the most carefully curated task distribution. That is where the capability gap is opening.

Why this matters now

Base models trained on the same public web corpus are converging in capability. When two labs use similar pretraining data and similar architectures, the post-training pipeline becomes the primary differentiator. Labs without proprietary RL infrastructure are effectively competing on a dimension that is being commoditized under them.

02 — Cursor Composer 1.5The clean anchor — September 2025.

In September 2025, Cursor published the Composer 1.5 announcement with a disclosure that no other frontier lab had made on a numbered model release. The key sentence, from the Cursor blog: the total compute invested in Composer 1.5's post-training even surpasses the amount used to pretrain the base model — a striking inversion of typical model development economics.

Alongside that qualitative claim, Cursor quantified the RL scale-up: Composer 1.5 was built by scaling reinforcement learning 20× further on the same pretrained model that powered Composer 1. This is the only case in the public record where a frontier coding lab has given both a directional disclosure (post-training > pretraining) and a quantitative multiplier (20×) on a specific model release. Every other lab has discussed the shift in generalities, in research papers, or in earnings calls. Cursor named a number.

The significance extends beyond Cursor's own product. If a startup with ~600 employees can run a 20× RL scale-up that pushes post-training past pretraining, the implication is that well-targeted RL compute — focused on a specific task distribution like agentic coding — can deliver disproportionate capability gains relative to raw model size. That is a different compute-economics equation than the one the industry has operated on since GPT-3. For more on the Composer 1.5 architecture, see our Composer 1.5 deep dive.

Post-training compute exceeding pretraining is not a theoretical inversion — it is a documented product decision at a company shipping to millions of developers.Digital Applied synthesis, May 17, 2026

03 — SpaceX × CursorA $60B option that is really an RL-infrastructure deal.

On April 21–22, 2026, SpaceX disclosed a $60 billion option to acquire Cursor, with a $10 billion breakup fee, exercisable approximately 30 days after SpaceX's planned IPO (targeted for June 12, 2026, at a $1.75 trillion valuation). The deal was covered by TechCrunch, Fortune, and CNBC.

Most coverage framed the deal as a Cursor valuation story — a startup worth $60 billion is a notable event. The more important read is structural: the partnership gives Cursor access to xAI Colossus's GPU cluster (approximately one million H100s as widely reported) for Composer 2.5's training loop. SpaceX does not acquire a coding tool. It acquires the RL post-training pipeline running on its compute substrate.

This recontextualizes the $60 billion figure. The premium is not for Cursor's current revenue or its IDE distribution. It is, at least in part, a valuation of the proprietary RL training loop and the agentic-coding task distribution that Composer 1.5 and 2 were built on. When compute scale is the differentiator, the company with access to the largest RL cluster and the most targeted task distribution holds the clearest path to the next benchmark jump. That is what SpaceX is optioning.

04 — Cursor Composer 2.5Launches tomorrow — trained on Colossus.

Cursor's Composer 2.5 is scheduled to launch on May 18, 2026 — tomorrow from this post's publish date. According to the Cursor blog, vendor-published benchmarks ahead of launch claim 63.2% on CursorBench v3.1 (a vendor-controlled benchmark — methodology caveats apply) and 79.8% on SWE-Bench Multilingual (vendor-published, not yet third-party verified). Pricing is set at $0.50 input / $2.50 output per million tokens (standard) and $3.00 / $15.00 (fast variant).

The Colossus substrate is what makes Composer 2.5 the most direct demonstration of the SpaceX deal's thesis: RL post-training for agentic coding, run at a scale previously available only to the largest AI labs. The Composer line is Cursor's primary product surface for RL research, and Composer 2.5 is the first model in that line trained with access to xAI Colossus compute. How that translates to real-world agentic task performance will be clearer after tomorrow's launch and third-party evaluation. For context on the Cursor product lineup, see our Cursor 3 + Composer review.

05 — OpenAI o-SeriesTwo scaling curves, not one.

OpenAI's o1 announcement on September 12, 2024 introduced a framing that has since become the industry's reference point. From the o1 post: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).”

The post published two scaling diagrams. Both show accuracy improving log-linearly with their respective compute axis. The conventional pretraining axis — model parameters as a function of training tokens — is treated as a third, now-mature dimension. What OpenAI identified is that two new independent scaling axes exist and both can be optimized separately: how much RL compute you spend during training, and how much inference compute you allocate at test time. Snell et al. (August 2024) formalized the test-time dimension at the same time, demonstrating that scaling LLM test-time compute can be more effective than scaling model parameters for certain problem classes.

The product surface of the test-time axis is now visible across the industry: OpenAI's GPT-5.2 ships a fifth reasoning effort tier called “xhigh” that allocates maximum test-time compute, Anthropic's visible extended thinking feature exposes the same axis for Claude, and DeepSeek V4's Think Max mode operates on the same principle. See our GPT-5.2 complete guide for the full reasoning-effort breakdown.

Post-training compute share — illustrative trajectory by era

Source: Cursor blog (Sep 2025), OpenAI o1 post (Sep 2024), Ouyang et al. 2022. Non-Cursor ratios are directional estimates, not vendor-disclosed figures.

InstructGPT (Ouyang et al. 2022)SFT → reward model → PPO — post-training as finishing coat

~10%

GPT-4 era (2023)RLHF at moderate scale — estimated qualitatively

~25%

OpenAI o1 (Sep 2024)RL train-time + test-time compute both scale — two independent axes

~50%

Cursor Composer 1.5 (Sep 2025)Disclosed: post-training compute > pretraining; 20× RL scale-up

>100%

Frontier direction (2026+)RL flywheel + proprietary task distributions — consolidation expected

Expanding

The bars above are explicit about what is sourced and what is directional. Only the Cursor Composer 1.5 row is a vendor-disclosed figure. The others represent the qualitative direction of the shift as supported by research papers and product announcements, not manufactured percentages. We flag this because the dominant fabrication pattern in RL coverage is “confidently-specific made-up percentages” (e.g. “Anthropic spends 70% of compute on RLHF”) — a claim that no lab has made publicly.

06 — AnthropicConstitutional AI — the RLAIF spine driving Opus 4.7.

Anthropic's post-training approach is anchored in Constitutional AI (CAI), introduced in Bai et al. (December 2022). The approach uses a base model to self-critique and self-revise its outputs against a written “constitution” of principles, replacing the human preference-labeling step in the reward modeling pipeline with AI feedback. This is RLAIF — Reinforcement Learning from AI Feedback — and it remains the canonical methodology for Claude's alignment training as of 2026. The CAI paper is the still-valid reference; we do not manufacture updates that have not been published.

The product-level evidence for post-training's role comes from Opus 4.7, released on April 16, 2026. According to Anthropic's announcement, Opus 4.7 hits 87.6% on SWE-Bench Verified (up from Opus 4.6's 80.8%) and 64.3% on SWE-Bench Pro (up from 53.4%). That is a 6.8-point Verified improvement and a 10.9-point Pro improvement. Anthropic attributes the gains to post-training — CAI plus RL — not to a base-model size increase. The same announcement notes that Rakuten teams reported Opus 4.7 resolving 3× more production tasks than Opus 4.6, with double-digit gains in code quality (vendor-stated metric).

The $30 billion Series G Anthropic closed in February 2026 at a $380 billion post-money valuation includes capital allocation for large RL-compute commitments through AWS Bedrock and GCP. Anthropic does not disclose its compute split between pretraining and post-training publicly. What it discloses is the outcome: benchmark improvements driven by the post-training pipeline, and a product roadmap (Claude Code, extended thinking, agents via the Claude Agent SDK) that is entirely post-training-dependent. Read our Opus 4.7 complete guide for the full benchmark breakdown. For the implications for AI transformation programs, see our AI transformation services.

07 — DeepSeek R1Reasoning from pure RL — and cheaper.

DeepSeek's R1 paper (arXiv, January 22, 2025) demonstrated something the RL-in-LLMs literature had theorized but not clearly shown at scale: reasoning capability can emerge from large-scale RL applied to a pretrained base model, with minimal supervised fine-tuning of the reasoning format. R1-Zero, the purest variant, used GRPO — Group Relative Policy Optimization — without a supervised pretraining of the reasoning trace format. The model developed chain-of-thought behavior as an emergent property of the RL objective.

GRPO matters because it is cheaper than the standard PPO pipeline. PPO requires a separate critic network running alongside the policy network to compute advantage estimates, doubling the memory and compute overhead. GRPO replaces the critic by sampling multiple outputs per prompt, computing the group-relative reward (how much better is this sample than the average of the group), and using that as the advantage estimate. No separate critic; no value-function training. For a lab operating at DeepSeek's compute budget, this is a meaningful cost reduction per RL training step.

The broader implication is that frontier-quality RL post-training is not exclusively the domain of labs with unlimited compute. GRPO, and similar algorithms like DPO (Rafailov et al., 2023, NeurIPS best paper) and KTO (Ethayarajh et al., 2024), are lowering the compute floor for competitive post-training. DeepSeek R1 is the clearest proof that the RL axis can be optimized independently of the pretraining parameter count — which makes the compute-efficiency story in the algorithm family map below directly relevant to teams evaluating post-training paths. If the acronym soup here is unfamiliar, our glossary of essential AI agent terms defines GRPO, DPO, RLAIF, and the rest of the post-training vocabulary in plain language.

RLHF / PPO

Classic pipeline

SFT → reward model → PPO

The InstructGPT-era standard. A separate reward model trained on human preferences; PPO optimizes the policy against it. Requires a critic network, high memory overhead. Still used by OpenAI in modified form.

Ouyang et al. 2022

DPO

Direct Preference

Single preference-loss objective

Eliminates the explicit reward model and PPO loop. Trains directly on (chosen, rejected) preference pairs with a closed-form loss. NeurIPS 2023 best paper. Widely adopted for instruction-following fine-tuning at lower compute.

Rafailov et al. 2023

KTO

Binary-label optimization

Good ∕ bad labels only

Kahneman-Tversky Optimization. Needs only a binary good/bad signal per example — no paired preference data required. Lower data cost than DPO; comparable alignment quality for many tasks.

Ethayarajh et al. 2024

GRPO

Group Relative

Group-sampled advantage, no critic

DeepSeek’s algorithm. Samples multiple outputs per prompt, uses group-relative reward as advantage estimate. Eliminates the critic network — roughly half the memory of PPO at equivalent policy quality. Enabled R1-Zero’s pure-RL reasoning.

DeepSeek-R1 — Jan 2025

RLAIF / CAI

AI feedback

Self-critique + constitutional revision

Anthropic’s Constitutional AI. A base model self-critiques against a written constitution, producing preference data without human labeling. Scales the preference-data generation step without proportional human cost.

Bai et al. 2022 — Anthropic canonical

RLAIF / RLHF hybrid

Mixed signals

Human + AI preference labels

Production-standard at most large labs. Human labels provide high-quality signal on nuanced cases; AI labels scale the volume. Exact ratios are lab-proprietary and not publicly disclosed.

Frontier production standard

08 — Disclosure MatrixWhat each lab has actually said.

Every coverage of “RL is the new scaling axis” cites the o1 blog or the Composer 1.5 blog in isolation. Below is the first compiled matrix of what all five frontier players have publicly disclosed about their post-training compute — with sources, and with explicit flags for what has not been disclosed. No manufactured percentages.

Cursor

Composer line — quantified

Disclosed: post-training compute surpassed pretraining; 20× RL scale-up on Composer 1.5 (Sep 2025). Methodology: RL on agentic-coding harness. Infrastructure: own cluster + xAI Colossus via SpaceX deal (April 2026). Most recent disclosure: Composer 1.5 blog + Composer 2.5 (launching May 18, 2026).

Quantified disclosure

OpenAI

o-series — qualitative curves

Disclosed: o1 publishes RL scaling curves (train-time + test-time both scale log-linearly); no compute split percentage stated. Methodology: RLHF + RL on chain-of-thought traces. Infrastructure: Microsoft Azure + own infra. Most recent: o1 blog, Sep 12, 2024.

Scaling curves, no split

Anthropic

Claude 4.x — CAI as spine

Disclosed: Constitutional AI is canonical post-training methodology (Bai et al. 2022). Opus 4.7 benchmark gains attributed to post-training. No compute split disclosed. Infrastructure: AWS Bedrock + GCP (Series G capital committed). Most recent: Opus 4.7 announcement, April 16, 2026.

Method disclosed, split undisclosed

DeepSeek

R1 + V3-V4 — training curves published

Disclosed: R1 paper publishes RL training curves; R1-Zero trained with pure RL (GRPO) without SFT. DeepSeek V4 uses On-Policy Distillation post-training. Infrastructure: own data center. Most recent: R1 paper, Jan 22, 2025; V4 report, Apr 2026.

Curves + GRPO published

xAI

Grok 4.x — infra disclosed, method undisclosed

Disclosed: Memphis Colossus (~1M H100s as widely reported) is the compute substrate for both xAI training and Cursor’s Composer 2.5 via the SpaceX deal. Post-training methodology: mixed, specifics undisclosed. Most recent: docs.x.ai, reverified May 17, 2026.

Infrastructure disclosed, method opaque

What this matrix shows

Cursor is the only lab with a quantified compute-split disclosure on a numbered model. OpenAI and DeepSeek have published scaling curves without splitting train from post-training budgets. Anthropic has named its methodology without quantifying the budget. xAI has disclosed its infrastructure without disclosing its methodology. If you see a coverage piece claiming precise post-training percentages for any lab other than Cursor, treat those numbers as fabricated until a primary source is cited.

09 — The New MoatPost-training data, RL infra, and where compute spend goes next.

The structural implication of the inversion is that the defensible moat in AI has shifted from pretraining data (largely public, largely shared) to post-training assets: the proprietary task distributions you can generate, the reward models you can build for your specific domain, and the RL infrastructure you can run the loop on. Base models trained on the same web corpus are converging. The divergence is happening in post-training.

This is already visible in the market. The $60 billion option on Cursor is an infrastructure bet on RL post-training for agentic coding. Anthropic's $30 billion Series G is, in part, a compute commitment for RL loops via AWS and GCP. OpenAI's Q1 2026 revenue of approximately $6 billion, with Codex as a highlighted growth driver per PYMNTS, reflects a post-training-specialized product generating revenue at scale. What matters is that Codex's share of that $6 billion is not broken out publicly — directional point, not a precise split.

Looking forward over the next 12 months: the trajectory suggests consolidation around four to five labs that can sustain a continuous RL flywheel — generating proprietary task distributions, training reward models against them, running large-scale RL loops on dedicated infrastructure, and shipping the output as product. Labs outside that group will likely face an accelerating capability gap that cannot be closed by pretraining alone. The economics of RL post-training favor scale and specialization. More $60 billion-class compute partnerships and infrastructure deals are probable. The open-weight path, exemplified by DeepSeek's GRPO work, may persist as a viable alternative for specific domains — but even there, the advantage accrues to labs with the best proprietary reward signal for their target task distribution.

For organizations evaluating AI platform choices, the post-training landscape means that the model you choose today may look very different in 12 months — not because the pretraining data changed, but because the RL loop continued. Choosing a lab is increasingly also choosing its post-training roadmap and compute commitment. For a practical framework on evaluating agentic coding tools across these dimensions, see our Q2 2026 agentic coding tools matrix.

The moat asset

Post-training moat components

Proprietary task distribution, domain-specific reward model, and RL compute infrastructure. Labs with all three can compound capability gains beyond what pretraining alone can deliver.

Replaces: pretraining data scale

Algorithm cost reduction

GRPO vs PPO memory

~2×

GRPO eliminates the critic network required by PPO, roughly halving per-step memory overhead at comparable policy quality. This is the cost lever DeepSeek used to make pure-RL reasoning viable on a constrained budget.

Source: DeepSeek R1 paper

SWE-Bench Verified lift

Opus 4.7 post-training gain

+6.8pts

From 80.8% (Opus 4.6) to 87.6% (Opus 4.7) on SWE-Bench Verified. Anthropic attributes the gain to post-training — CAI plus RL — not a base-model parameter increase.

Source: Anthropic, April 2026

What the inversion means

Post-training is where the frontier moat is being built.

The compute-curve inversion is real and documented. Cursor has the only quantified disclosure. OpenAI, Anthropic, and DeepSeek have each confirmed the shift in their own way — through scaling curves, methodology papers, and benchmark results that cannot be explained by pretraining alone. The SpaceX × Cursor deal converts a valuation story into an infrastructure story: the premium is on the RL loop, not the IDE.

The moat that matters in 2026 is post-training data, domain-specific reward models, and the RL compute infrastructure to run the loop at scale. Base models are converging. The differentiation is in what you do to the base model after pretraining ends. Labs that can sustain a continuous RL flywheel — and the four to five that can currently afford to — are pulling ahead on a compounding curve.

Over the next 12 months, expect more infrastructure deals at the Colossus scale, further consolidation of post-training leadership among a small number of labs, and a widening capability gap between those labs and those competing only on pretraining. For teams building on AI foundations today, the strategic question is not which base model is largest — it is which post-training roadmap aligns with your task distribution and how quickly that roadmap compounds. Our AI transformation engagements start with exactly this kind of roadmap evaluation.

Post-Training Is the Moat — RL Compute Wins 2026